Deep Learning for Traffic Scene Understanding: A Review

交通场景理解的深度学习综述

PARYA DOLATYABI 1 , (Graduate Student Member, IEEE), JACOB REGAN 1 , (Graduate Student Member, IEEE), AND MAHDI KHODAYAR , (Member, IEEE)

PARYA DOLATYABI 1,(IEEE研究生会员),JACOB REGAN 1,(IEEE研究生会员),以及 MAHDI KHODAYAR ,(IEEE会员)

Department of Computer Science, University of Tulsa, Tulsa, OK 74104, USA

美国俄克拉荷马州塔尔萨市,塔尔萨大学计算机科学系,邮编74104

Corresponding author: Parya Dolatyabi (pad7492@utulsa.edu)

通讯作者:Parya Dolatyabi (pad7492@utulsa.edu)

This work was supported in part by U.S. Department of Transportation (USDOT) under Grant 693JJ32350030.

本工作部分由美国交通部(USDOT)资助,资助编号693JJ32350030。

INDEX TERMS Deep learning, traffic scene understanding, discriminative models, generative models, domain adaptation, classification, object detection, segmentation.

关键词 深度学习,交通场景理解,判别模型,生成模型,领域自适应,分类,目标检测,分割。

I. INTRODUCTION

一、引言

The rapid evolution of deep learning (DL), particularly in computer vision, has initiated a new era of intelligent transportation systems. Researchers have made significant strides in advancing autonomous vehicles, traffic management, and pedestrian safety by fusing deep neural architectures with the complexities of traffic scenes. However, despite these advancements, critical challenges persist in effectively translating theoretical breakthroughs into robust, real-world applications, such as handling the variability of traffic environments, ensuring real-time processing, and achieving high accuracy under diverse conditions. This review aims to thoroughly explore these challenges by offering an in-depth analysis of the complex interactions between deep neural networks (DNNs), computer vision, and traffic scene understanding.

深度学习(DL),尤其是在计算机视觉领域的快速发展,开启了智能交通系统的新纪元。研究人员通过将深度神经架构与交通场景的复杂性相结合,在自动驾驶、交通管理和行人安全方面取得了显著进展。然而,尽管取得了这些进展,如何将理论突破有效转化为稳健的实际应用仍面临关键挑战,如应对交通环境的多样性、确保实时处理能力以及在多变条件下实现高精度。本文旨在通过深入分析深度神经网络(DNN)、计算机视觉与交通场景理解之间的复杂交互,全面探讨这些挑战。

Previous studies have made substantial contributions to this field. However, they also exhibit certain limitations. For example, [1] provided an extensive survey on deep learning-based object detection in traffic scenarios, covering over 100 papers and highlighting challenges such as real-time performance, image quality degradation, and object occlusion. Autonomous driving technologies were investigated in [2], with a focus on DL methods for perception, mapping, and sensor fusion. They also pointed out limitations in multi-sensor integration and prediction accuracy. The authors of [3] covered deep learning techniques for object detection, semantic segmentation, instance segmentation, and lane line segmentation in autonomous driving. They highlighted key challenges such as high computational cost, real-time performance limitations, and occlusion issues-particularly in instance segmentation, where region proposal-based methods often struggle with small or occluded objects. Advanced methods, such as Adaptive Feature (AF) pooling, were suggested to improve efficiency in these scenarios. The study in [4] reviews methods based on artificial intelligence (AI), including convolutional neural networks (CNNs) and reinforcement learning, for tasks such as driving scene perception, path planning, and motion control. It discusses challenges including handling occlusion, particularly during scene perception, where occluded objects often hinder accurate detection and recognition. Despite their valuable insights, these studies face several critical shortcomings:

以往研究对该领域做出了重要贡献,但也存在一定局限。例如,[1]对基于深度学习的交通场景目标检测进行了广泛综述,涵盖了100多篇论文,重点指出了实时性能、图像质量下降和目标遮挡等挑战。[2]研究了自动驾驶技术,聚焦于感知、地图构建和传感器融合的深度学习方法,同时指出了多传感器集成和预测精度的不足。[3]涵盖了自动驾驶中的目标检测、语义分割、实例分割和车道线分割的深度学习技术,强调了高计算成本、实时性能限制及遮挡问题,尤其是在实例分割中,基于区域提议的方法常在处理小目标或遮挡目标时表现不佳。文中建议采用自适应特征(Adaptive Feature, AF)池化等先进方法以提升效率。[4]综述了基于人工智能(AI)的方法,包括卷积神经网络(CNNs)和强化学习,用于驾驶场景感知、路径规划和运动控制,讨论了遮挡处理的挑战,特别是在场景感知中,遮挡目标常妨碍准确检测和识别。尽管这些研究提供了宝贵见解,但仍存在若干关键不足:


The associate editor coordinating the review of this manuscript and approving it for publication was Turgay Celik 1 .

本稿件的审稿协调编辑并批准发表为Turgay Celik 1


  1. Current review papers largely focus on discriminative models, with limited coverage of generative models crucial for synthetic data generation in traffic scenarios. Additionally, they offer limited discussion on domain adaptation (DA) techniques, which are essential for transferring models across different environmental conditions to enhance robustness in traffic scene analysis [1], [2], [3].
  1. 现有综述论文主要聚焦于判别模型,对生成模型的覆盖有限,而生成模型在交通场景合成数据生成中至关重要。此外,对领域自适应(DA)技术的讨论也较少,而该技术对于跨不同环境条件迁移模型、提升交通场景分析的鲁棒性至关重要[1],[2],[3]。
  1. Current surveys often lack a detailed discussion on hyperparameter optimization (HPO) [1], [2], [3], [4], which is crucial for fine-tuning DL models in complex traffic scenarios. This omission is significant, as HPO enhances model efficiency, reduces training time, and improves real-time deployment feasibility.
  1. 现有综述常常缺乏对超参数优化(HPO, Hyperparameter Optimization)的详细讨论[1],[2],[3],[4],而这对于在复杂交通场景中微调深度学习(DL, Deep Learning)模型至关重要。此缺失具有重要意义,因为HPO能够提升模型效率,缩短训练时间,并改善实时部署的可行性。
  1. Many existing survey papers fail to provide a thorough comparison of different deep learning architectures, especially among discriminative, generative, and domain adaptation categories [1], [4]. This omission is important, as such comparisons are essential for understanding the unique strengths and weaknesses of each approach, which plays a key role in making informed decisions about model selection and application.
  1. 许多现有综述论文未能对不同深度学习架构,尤其是判别式、生成式和领域自适应(Domain Adaptation)类别进行全面比较[1],[4]。这一遗漏十分重要,因为此类比较对于理解各方法的独特优势和劣势至关重要,进而在模型选择和应用决策中发挥关键作用。
  1. Existing literature lacks a comprehensive comparative discussion of various models, including their advantages, disadvantages, and potential areas for future research [1], [2], [3], [4]. Such an analysis is essential for understanding the trade-offs between different models and identifying opportunities for advancing state-of-the-art solutions in complex traffic environments.
  1. 现有文献缺乏对各种模型的全面比较讨论,包括其优缺点及未来研究潜力[1],[2],[3],[4]。此类分析对于理解不同模型间的权衡关系及识别推动复杂交通环境中先进解决方案发展的机会至关重要。
  1. Current reviews often cover challenges like safety and hardware but neglect emerging areas such as Explainable AI (XAI) and real-time processing [2], [3]. These areas are essential for advancing traffic scene understanding and ensuring AI systems are transparent and effective in dynamic environments.
  1. 当前综述常涵盖安全性和硬件等挑战,但忽视了可解释人工智能(XAI, Explainable AI)和实时处理等新兴领域[2],[3]。这些领域对于推进交通场景理解及确保AI系统在动态环境中的透明性和有效性至关重要。
  1. Current studies are outdated, as they do not include a review of the most recent developments in the field, thereby missing the latest advancements shaping DL applications in traffic scene understanding [1], [2], [3], [4]. This is a significant limitation, as staying updated with the latest research is crucial for providing a comprehensive and forward-looking resource.
  1. 现有研究已显陈旧,未涵盖该领域最新进展,因而错过了塑造交通场景理解中深度学习应用的最新成果[1],[2],[3],[4]。这一限制显著,因为紧跟最新研究对于提供全面且具有前瞻性的资源至关重要。

To address these gaps, our paper studies the core computer vision techniques of classification, object detection, and segmentation, while also extending its analysis to cover advanced topics including action recognition, object tracking, path prediction, anomaly detection, scene generation, and image enhancement. By synthesizing findings from a broad spectrum of studies, our paper provides a holistic overview of the evolution from traditional image processing methods to advanced DL models, including Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), and Domain Adaptation models. It emphasizes the integration of these models into real-world applications such as autonomous driving, traffic management, and pedestrian safety, while also addressing challenges including occlusions, dynamic urban traffic environments, and varying weather and lighting conditions. The contributions of our paper are:

为填补这些空白,本文研究了分类、目标检测和分割等核心计算机视觉技术,同时扩展分析涵盖动作识别、目标跟踪、路径预测、异常检测、场景生成和图像增强等高级主题。通过综合广泛研究成果,本文提供了从传统图像处理方法到先进深度学习模型(包括卷积神经网络(CNN, Convolutional Neural Networks)、生成对抗网络(GAN, Generative Adversarial Networks)及领域自适应模型)的整体演进概述。强调了这些模型在自动驾驶、交通管理和行人安全等实际应用中的整合,同时探讨了遮挡、动态城市交通环境及多变天气和光照条件等挑战。本文贡献包括:

  1. We present a categorized exploration of discriminative, generative, and DA models, including detailed analyses of YOLO (You Only Look Once) variants, Vision Transformers (ViTs), graph-based models, and various DA techniques. Our approach not only covers these models comprehensively but also emphasizes their advantages, such as improved accuracy in traffic scenarios with discriminative models, enhanced training through generative models for synthetic data creation, and increased robustness in real-world conditions with DA techniques.
  1. 我们呈现了判别式、生成式和领域自适应模型的分类探索,详细分析了YOLO(You Only Look Once)变体、视觉变换器(ViTs, Vision Transformers)、基于图的模型及多种领域自适应技术。我们的方法不仅全面覆盖这些模型,还强调其优势,如判别式模型在交通场景中的精度提升,生成式模型通过合成数据增强训练,以及领域自适应技术在真实环境中的鲁棒性增强。
  1. HPO strategies are discussed in our work, with specific sections dedicated to each category of DL architectures-discriminative, generative, and DA. This provides insights into optimizing these models for better performance, ensuring they are fine-tuned for optimal results in traffic scene understanding.
  1. 本文讨论了超参数优化策略,针对判别式、生成式和领域自适应三类深度学习架构设有专门章节,提供了优化这些模型以提升性能的见解,确保其在交通场景理解中达到最佳调优效果。
  1. Our review provides comparisons of different DL architectures across discriminative, generative, and DA categories, focusing on application frameworks, variance in datasets, performance metrics, and overall results, providing clear guidance on selecting the most effective models for traffic scene understanding.
  1. 我们的综述比较了判别式、生成式和领域自适应类别中不同深度学习架构,重点关注应用框架、数据集差异、性能指标及整体结果,为选择最有效的交通场景理解模型提供明确指导。
  1. A comparative discussion of discriminative, generative, and DA models is provided in our work, highlighting their advantages, disadvantages, and potential directions for future research. This detailed analysis helps to clarify the strengths and limitations of each model, guiding future development and innovation in complex traffic scenarios.
  1. 本文提供了判别式、生成式和领域自适应模型的比较讨论,突出其优缺点及未来研究方向。此详尽分析有助于澄清各模型的优势与局限,指导复杂交通场景中未来的发展与创新。
  1. We highlight emerging research trends such as XAI, augmenting vision backbones with ViT and GNN, real-time traffic scene processing, overcoming data limitations using synthetic data, and enhancing perception via multi-modality and data fusion. This structured perspective is crucial for advancing traffic scene understanding and providing a thorough, future-focused review.
  1. 我们强调了可解释人工智能(XAI)、结合视觉变换器(ViT)和图神经网络(GNN)增强视觉骨干、实时交通场景处理、利用合成数据克服数据限制,以及通过多模态和数据融合提升感知能力等新兴研究趋势。这一结构化视角对于推进交通场景理解及提供全面且面向未来的综述至关重要。
  1. We provide an up-to-date review of the latest papers and advancements, incorporating the most recent research insights. Keeping current with these developments is crucial for delivering a thorough and future-oriented resource.
  1. 我们提供了最新论文和进展的及时综述,融合了最新研究见解。紧跟这些发展对于提供全面且具有前瞻性的资源至关重要。

These contributions make our paper a more thorough and forward-thinking resource, offering a critical assessment of current research, highlighting limitations, and presenting new perspectives on traffic scene understanding. By addressing the shortcomings of previous studies and offering a clear articulation of current challenges faced, this review aims to inspire future research and development efforts that drive innovation in deep learning for traffic scene understanding.

这些贡献使本文成为更全面且具有前瞻性的资源,提供了对当前研究的批判性评估,突出其局限性,并呈现了交通场景理解的新视角。通过弥补以往研究的不足并清晰阐述当前面临的挑战,本综述旨在激励未来的研究与开发,推动深度学习在交通场景理解领域的创新。

The rest of this paper is organized as follows. In Section II, we introduce the discriminative DL models, including CNNs, region-based CNN (R-CNN) variants, YOLO, ViT, DETR (Detection Transformer), graph-based models, and capsule networks (CapsNets). Section III focuses on generative machine learning (ML) models, encompassing GANs, conditional GANs (cGANs), and variational autoencoders (VAEs). In Section IV, we explore DA models within the categories of clustering-based, discrepancy-based, and adversarial-based approaches. A comparative discussion of these models is provided in Section V. HPO techniques are detailed within each category. Finally, future research areas and concluding remarks are presented in Sections VI and VII, respectively.

本文余下部分组织如下。第二节介绍判别式深度学习(DL)模型,包括卷积神经网络(CNNs)、基于区域的CNN(R-CNN)变体、YOLO、视觉变换器(ViT)、检测变换器(DETR,Detection Transformer)、基于图的模型以及胶囊网络(CapsNets)。第三节聚焦生成式机器学习(ML)模型,涵盖生成对抗网络(GANs)、条件生成对抗网络(cGANs)和变分自编码器(VAEs)。第四节探讨域适应(DA)模型,分为基于聚类、基于差异和基于对抗的方法。第五节对这些模型进行比较讨论。各类别中均详细介绍了超参数优化(HPO)技术。最后,第六节和第七节分别呈现未来研究方向和总结性评论。

II. DATASETS

二、数据集

In this section, we focus on the most popular datasets identified in the papers reviewed in our study. These datasets are widely used across various tasks in traffic scene understanding, including object detection, segmentation, 3D tracking, classification, and domain adaptation. They have been selected based on their frequency of citation, versatility, and relevance to core applications. For less popular, niche, or highly customized datasets, readers are referred to the corresponding references cited in the respective works.

本节重点介绍我们研究中综述论文中识别出的最流行数据集。这些数据集广泛应用于交通场景理解的各类任务,包括目标检测、分割、三维跟踪、分类和域适应。它们的选择基于引用频率、多功能性及与核心应用的相关性。对于较少使用、专业化或高度定制的数据集,读者可参考相关文献中的引用。

Table 1 summarizes 11 widely used datasets in traffic scene understanding, categorized based on their applications and characteristics. COCO 2017, VOC2007, and Cityscapes are benchmarks for object detection and segmentation, offering extensive annotations for diverse object categories and urban scenes. KITTI and nuScenes focus on 3D object detection and multi-object tracking, with KITTI emphasizing structured environments and nuScenes extending to radar data and more dynamic scenarios. GTSRB specializes in traffic sign recognition, providing a targeted dataset for autonomous driving systems. 1043-syn is a synthetic dataset optimized for traffic scene classification, particularly under controlled lighting and object variation scenarios. For person re-identification, DukeMTMC-ReID is a key benchmark, supporting identity matching tasks across multi-camera setups.

表1总结了11个广泛应用于交通场景理解的数据集,按应用和特征分类。COCO 2017、VOC2007和Cityscapes是目标检测和分割的基准,提供丰富的多类别目标和城市场景标注。KITTI和nuScenes聚焦三维目标检测和多目标跟踪,KITTI强调结构化环境,nuScenes则扩展至雷达数据和更动态场景。GTSRB专注于交通标志识别,为自动驾驶系统提供针对性数据集。1043-syn是一个合成数据集,优化用于交通场景分类,特别是在受控光照和目标变化条件下。DukeMTMC-ReID是行人重识别的关键基准,支持多摄像头身份匹配任务。

These datasets vary in scale and context, with real-world datasets like BDD, Mapillary, and Cityscapes capturing diverse weather and geographic conditions, while synthetic datasets like SYNTHIA and 1043-syn simulate controlled scenarios for domain adaptation and classification. Large-scale datasets such as COCO 2017 and BDD provide extensive data for deep learning, whereas smaller datasets like VOC2007 and KITTI offer high-quality annotations for specific tasks. This combination of real-world variability, geographic diversity, and synthetic precision allows researchers to address multifaceted challenges in traffic scene understanding, leveraging the strengths of each dataset for robust model development.

这些数据集在规模和背景上各异,真实世界数据集如BDD、Mapillary和Cityscapes涵盖多样的天气和地理条件,而合成数据集如SYNTHIA和1043-syn模拟受控场景以支持域适应和分类。大规模数据集如COCO 2017和BDD为深度学习提供丰富数据,小规模数据集如VOC2007和KITTI则为特定任务提供高质量标注。真实世界的多样性、地理差异和合成数据的精确性相结合,使研究者能够应对交通场景理解中的多方面挑战,充分利用各数据集优势以开发鲁棒模型。

III. DISCRIMINATIVE DL MODELS

三、判别式深度学习模型

Discriminative DL models, often based on CNNs, are crucial for understanding complex traffic scenes. They excel at distinguishing objects and patterns, enabling tasks like object detection, classification, and segmentation. In traffic contexts, these models accurately identify vehicles, pedestrians, and road signs, enhancing real-time analysis in video feeds. By leveraging discriminative DL, systems improve road safety and efficiency, assisting autonomous navigation, traffic flow analysis, and pedestrian behavior prediction. This advancement supports intelligent transportation systems and enhances overall road safety.

判别式深度学习模型,通常基于卷积神经网络(CNNs),在理解复杂交通场景中至关重要。它们擅长区分目标和模式,支持目标检测、分类和分割等任务。在交通环境中,这些模型能准确识别车辆、行人和交通标志,提升视频流的实时分析能力。通过利用判别式深度学习,系统增强了道路安全和效率,辅助自动导航、交通流分析及行人行为预测。这一进展支持智能交通系统,提升整体道路安全性。

In the following sections, we examine various discriminative DL models-R-CNNs, YOLO variants, and attention mechanisms-tailored for traffic scene understanding. These models address tasks like object detection, semantic segmentation, and action recognition, influencing intelligent transportation systems. We also discuss HPO for these architectures and compare performance metrics, providing a comprehensive overview.

在接下来的章节中,我们将考察多种判别式深度学习模型——R-CNNs、YOLO变体及注意力机制——专为交通场景理解设计。这些模型处理目标检测、语义分割和动作识别等任务,影响智能交通系统。我们还将讨论这些架构的超参数优化(HPO)并比较性能指标,提供全面概述。

A. CNN

A. 卷积神经网络(CNN)

A CNN [5] is a DL model designed for grid-like data (e.g., images), commonly used in traffic scene understanding to analyze camera images. CNNs automatically extract features like objects, signs, and road markings, supporting real-time processing and enhancing road safety for autonomous vehicles.

卷积神经网络(CNN)[5]是一种针对网格状数据(如图像)设计的深度学习模型,常用于交通场景理解中的摄像头图像分析。CNN自动提取目标、标志和道路标线等特征,支持实时处理,提升自动驾驶车辆的道路安全性。

A basic CNN for image classification follows: convolution, pooling, fully connected (FC) layers, and output. Convolution applies filters to generate feature maps, which are pooled and flattened before passing through FC layers, with a softmax layer for class probabilities.

图像分类的基本CNN流程包括:卷积、池化、全连接(FC)层和输出。卷积通过滤波器生成特征图,随后进行池化和展平,传入全连接层,最后通过softmax层输出类别概率。

The core CNN operation is convolution:

CNN的核心操作是卷积:

(1)(IK)(x,y)=i=0m1j=0n1I(x+i,y+j)K(i,j),

where I is the input image, K is the kernel of size m×n ,and (x,y)are output coordinates.

其中I为输入图像,K为大小为m×n的卷积核,(x,y)为输出坐标。

TABLE 1. A summary of the most popular datasets identified in the papers reviewed in our study, categorized based on their characteristics and key features. Approximate numbers are used for dataset sizes to account for variations across versions, releases, or documentation. These datasets are widely adopted for diverse tasks such as object detection, segmentation, 3D tracking, classification, and domain adaptation, with applications spanning real-world scenarios and synthetic simulations. The inclusion of train and test sizes, along with geographic or virtual origins, highlights the diversity and specificity of these datasets in advancing traffic scene understanding. The term Varies in the Image Size column indicates datasets with images of multiple resolutions. Trainva/ refers to a combined set of training and validation images. Fine and Coarse denote levels of annotation granularity, with fine being pixel-accurate and coarse being approximate or less detailed.

表1. 本研究综述论文中识别出的最流行数据集的汇总,基于其特征和关键属性进行分类。数据集规模采用近似数值,以适应不同版本、发布或文档中的差异。这些数据集被广泛应用于目标检测、分割、三维跟踪、分类和领域适应等多种任务,涵盖现实场景和合成模拟。列出训练和测试规模及地理或虚拟来源,突显了这些数据集在推动交通场景理解方面的多样性和针对性。图像尺寸栏中的“Varies”表示数据集包含多分辨率图像。Trainva/ 指训练集与验证集的合并。Fine 和 Coarse 分别表示标注粒度,前者为像素级精确,后者为近似或较粗略。

DatasetAbbreviationApplicationTrain SizeTest SizeImage Size (pixels)Location
COCO 2017Common Objects in ContextObject detection, segmentation, and image captioning.118,00041,000Varies (e.g., 640 \( \times {480} \) to \( {2048} \times \) 1024)Global
KITTIKarlsruhe Institute of Technology and Toyota Technological Institute3D object detection, multi-object tracking.7,4817,518\( {1242} \times {375} \)Karlsruhe, Germany
GTSRBGerman Traffic Sign Recognition BenchmarkTraffic sign classification and recognition.39,20912,630Varies ( \( {15} \times {15} \) to \( {250} \times {250}) \)Germany
VOC2007PASCAL Visual Object Classes 2007Object detection and classification.5,011 (trainval)4,952Varies \( \left( {{500} \times {375}}\right) \)Europe
CityscapesCityscapesSemantic segmentation of urban scenes.3,475 (fine), 20,000 (coarse)1,525 (fine)\( {2048} \times {1024} \)Multiple cities in Germany
nuScenesnuScenes3D object detection, multi-sensor tracking.28,1306,008\( {1600} \times {900} \)Boston, USA; Singapore
BDDBerkeley DeepDrive DatasetObject detection, segmentation, classifica- tion.70,00020,000\( {1280} \times {720} \)USA
MapillaryMapillary DatasetStreet-level semantic segmentation.20,0005,000VariesGlobal
SYNTHIASynthetic Images for TrainingSynthetic data for segmentation, domain adaptation.8,0001,400\( {960} \times {720} \)Virtual (synthetic)
DukeMTMC- ReIDDuke Multi-Target Multi-Camera Re-ID DatasetPerson re-identification.16,52219,889\( {128} \times {64} \)Duke University, USA
1043-syn1043 Synthetic DatasetSynthetic dataset for classification, object recognition.8,0002,000\( {640} \times {480} \)Virtual (synthetic)
数据集缩写应用训练集大小测试集大小图像尺寸(像素)地点
COCO 2017上下文中的常见物体(Common Objects in Context)目标检测、分割和图像描述。118,00041,000变化(例如,640 \( \times {480} \) 到 \( {2048} \times \) 1024)全球
KITTI卡尔斯鲁厄理工学院(Karlsruhe Institute of Technology)和丰田技术研究院(Toyota Technological Institute)三维目标检测、多目标跟踪。7,4817,518\( {1242} \times {375} \)德国卡尔斯鲁厄
GTSRB德国交通标志识别基准(German Traffic Sign Recognition Benchmark)交通标志分类与识别。39,20912,630变化(\( {15} \times {15} \) 到 \( {250} \times {250}) \))德国
VOC2007PASCAL视觉对象类别2007(PASCAL Visual Object Classes 2007)目标检测与分类。5,011(训练验证集)4,952变化 \( \left( {{500} \times {375}}\right) \)欧洲
CityscapesCityscapes城市场景的语义分割。3,475(精细标注),20,000(粗略标注)1,525(精细标注)\( {2048} \times {1024} \)德国多个城市
nuScenesnuScenes三维目标检测、多传感器跟踪。28,1306,008\( {1600} \times {900} \)美国波士顿;新加坡
BDD伯克利深度驾驶数据集(Berkeley DeepDrive Dataset)目标检测、分割、分类。70,00020,000\( {1280} \times {720} \)美国
MapillaryMapillary数据集街道级语义分割。20,0005,000变化全球
SYNTHIA用于训练的合成图像(Synthetic Images for Training)用于分割和领域适应的合成数据。8,0001,400\( {960} \times {720} \)虚拟(合成)
DukeMTMC-ReID杜克大学多目标多摄像头重识别数据集(Duke Multi-Target Multi-Camera Re-ID Dataset)行人重识别。16,52219,889\( {128} \times {64} \)美国杜克大学
1043-syn1043合成数据集用于分类和目标识别的合成数据集。8,0002,000\( {640} \times {480} \)虚拟(合成)

Next, a ReLU activation function introduces non-linearity: fReLU (x)=max(0,x) . A pooling layer reduces spatial dimensions to decrease parameters and computation. Max pooling is defined as:

接下来,ReLU激活函数引入非线性:fReLU (x)=max(0,x)。池化层通过减少空间维度来降低参数量和计算量。最大池化定义为:

(2)P(x,y)=max0i<p,0j<qI(x+i,y+j),

where P is the pooled output,and p×q is the pooling window size.

其中P是池化输出,p×q是池化窗口大小。

The FC layer performs classification,with output y=W . x+b ,where W is the weight matrix, x the input,and b the bias. Finally, softmax converts logits to probabilities:

全连接层执行分类,输出为y=Wx+b,其中W是权重矩阵,x是输入,b是偏置。最后,softmax将logits转换为概率:

(3)fsoftmax (ζi)=eζij=1neζj,

where ζi is the i -th output element,and n is the number of classes.

其中ζi是第i个输出元素,n是类别数。

Some applications of CNNs in traffic scene understanding include early implementations of recognizing traffic signs with 99% accuracy, though the recognition time was relatively long for real-time applications [6]. Modern adaptations have surpassed human performance in tasks like traffic sign recognition [7]. CNNs also enhance free-space detection through data fusion techniques [8] and improve recognition of traffic police gestures [9].

卷积神经网络(CNN)在交通场景理解中的一些应用包括早期实现的交通标志识别,准确率达到99%,但识别时间较长,不适合实时应用[6]。现代改进已在交通标志识别等任务中超越人类表现[7]。CNN还通过数据融合技术提升了自由空间检测[8],并改善了交通警察手势识别[9]。

Shortly after the introduction of CNNs in the 1980s, [6] applied fractal texture segmentation for traffic sign detection using a receptive field neural network (NN). The network had an input layer of 32×32 neurons,an output layer with ten neurons,and four hidden layers of 16×16,8×8,4×4 ,and 30 neurons. It was trained to recognize 9 types of traffic signs from images at 1,2, and 3 meters, achieving 99% accuracy. However, the recognition time of 4 seconds is too long for real-time use. The "RFNN_TSR" dataset includes nine road signs for landmark recognition in outdoor settings at varying distances.

在20世纪80年代CNN引入后不久,[6]利用感受野神经网络(NN)应用分形纹理分割进行交通标志检测。该网络输入层有32×32个神经元,输出层有10个神经元,四个隐藏层分别有16×16,8×8,4×4和30个神经元。训练用于识别1米、2米和3米距离的9种交通标志,准确率达99%。但4秒的识别时间过长,不适合实时使用。“RFNN_TSR”数据集包含九种路标,用于户外不同距离的地标识别。

In [7], the traditional CNN architecture was modified to incorporate multi-scale features, achieving an accuracy of 99.17% on the GTSRB dataset, surpassing human performance ( 98.81% ). Initially,using 32×32 color images, the model achieved 98.97% accuracy, with even randomly generated features yielding a competitive 97.33 .

[7]中对传统CNN架构进行了修改,加入多尺度特征,在GTSRB数据集上实现了99.17%的准确率,超过了人类表现(98.81%)。最初使用32×32彩色图像,模型准确率为98.97%,即使是随机生成的特征也能达到有竞争力的97.33%。

SNE-RoadSeg [8] integrates surface normal estimation (SNE) with a data-fusion CNN architecture for enhanced free-space detection, showcasing a unique dual-encoder system that merges RGB and surface normal information. This fusion, along with densely-connected skip connections in the decoder, enables precise segmentation. On the KITTI benchmark, it achieves an average precision (AP) of 94.07%.

SNE-RoadSeg[8]结合了表面法线估计(SNE)与数据融合CNN架构以增强自由空间检测,展示了独特的双编码器系统,融合了RGB和表面法线信息。该融合及解码器中的密集跳跃连接实现了精确分割。在KITTI基准测试中,平均精度(AP)达到94.07%。

A novel approach to traffic police gesture recognition is proposed in [9], combining a modified Convolutional Pose Machine (CPM) with a Long Short-Term Memory (LSTM) for temporal feature extraction. Enhanced by handcrafted features like Relative Bone Length and Angle with Gravity, it achieves 91.18% accuracy on the TPGR dataset.

[9]提出了一种新颖的交通警察手势识别方法,结合了改进的卷积姿态机(CPM)和长短时记忆网络(LSTM)进行时序特征提取。通过相对骨长和重力角度等手工特征增强,在TPGR数据集上实现了91.18%的准确率。

B. R-CNN

B. R-CNN

In this section, we delve into the R-CNN family of models, which build upon the strengths of CNNs by introducing region-based detection for improved precision in complex scenarios such as traffic monitoring. We will explore the evolution of R-CNN models, beginning with Vanilla R-CNN and progressing through Fast R-CNN, Faster R-CNN, and Mask R-CNN, highlighting their advancements and contributions to traffic scene understanding.

本节深入探讨R-CNN系列模型,该系列基于CNN的优势,引入基于区域的检测方法,以提升复杂场景如交通监控中的检测精度。我们将回顾R-CNN模型的发展历程,从基础的Vanilla R-CNN开始,依次介绍Fast R-CNN、Faster R-CNN和Mask R-CNN,重点突出它们在交通场景理解中的进展和贡献。

FIGURE 1. Vanilla R-CNN workflow for object detection in a traffic scene: The process starts with identifying a set of proposed regions that could contain objects. Each proposed region is then passed through a pre-trained CNN to extract features, followed by classification using class-specific SVMs. Finally, bounding boxes are refined to enhance localization accuracy. This workflow demonstrates the ability to accurately detect and classify objects such as cars, poles, and trees, achieving precise object localization and high reliability in real-time traffic monitoring applications.

图1. Vanilla R-CNN在交通场景中的目标检测流程:首先识别一组可能包含目标的候选区域。每个候选区域通过预训练CNN提取特征,随后使用针对类别的支持向量机(SVM)进行分类。最后,边界框被精细调整以提升定位精度。该流程展示了准确检测和分类汽车、电线杆和树木等目标的能力,实现了精确的目标定位和高可靠性的实时交通监控。

1) VANILLA R-CNN

1) Vanilla R-CNN

Vanilla R-CNN [10] extends traditional CNNs by using region proposals and pretrained CNNs for object detection. It generates region proposals to hypothesize object locations, processes each region through a CNN to extract feature vectors, and classifies these vectors with class-specific SVMs and bounding box regressors. This allows R-CNNs to manage object variability and achieve superior detection performance.

Vanilla R-CNN[10]通过使用区域提议和预训练CNN扩展了传统CNN的目标检测能力。它生成区域提议以假设目标位置,利用CNN处理每个区域提取特征向量,并通过类别特定的SVM和边界框回归器进行分类。这使得R-CNN能够处理目标的多样性并实现优越的检测性能。

Figure 1 illustrates the Vanilla R-CNN process for object detection in a traffic scene. The first step is generating region proposals that may contain objects. If the image is denoted as I ,the set of region proposals can be represented as R= {r1,r2,,rn} ,where each ri is a bounding box.

图1展示了用于交通场景中目标检测的基础R-CNN流程。第一步是生成可能包含目标的区域候选。如果图像表示为I,则区域候选集合可表示为R= {r1,r2,,rn},其中每个ri是一个边界框。

Each region ri is passed through a pre-trained CNN to extract a feature vector Fi . The CNN maps the region ri to the feature vector Fi ,represented as Fi=f(ri) .

每个区域ri通过预训练的卷积神经网络(CNN)提取特征向量Fi。CNN将区域ri映射为特征向量Fi,表示为Fi=f(ri)

The extracted feature vectors are classified using class-specific linear SVMs. The score for region ri belonging to class j is Sij=fSVMj(Fi) ,where fSVMj is the SVM for class j .

提取的特征向量通过类别特定的线性支持向量机(SVM)进行分类。区域ri属于类别j的得分为Sij=fSVMj(Fi),其中fSVMj是类别j的SVM。

Finally, the bounding box for each region proposal is refined using a bounding box regressor. This regressor, g , takes the feature vector Fi and the original bounding box ri as input and outputs a new bounding box bi ,represented as bi= g(Fi,ri) . The output of R-CNN is a set of bounding boxes with their corresponding class labels, where each bounding box is assigned to the class with the highest SVM score.

最后,使用边界框回归器对每个区域候选的边界框进行精细调整。该回归器g以特征向量Fi和原始边界框ri为输入,输出新的边界框bi,表示为bi= g(Fi,ri)。R-CNN的输出是一组带有对应类别标签的边界框,每个边界框被分配给得分最高的SVM类别。

R-CNNs have greatly advanced object detection in traffic scenes, surpassing traditional CNNs with accuracy rates of up to 75.6% on the COCO dataset [11]. They excel in detecting pedestrians [12], vehicles [13], and traffic signs [14], [15]. Innovations like cascaded architectures [16], attention mechanisms [17], and hybrid approaches [18] further improve their performance. These advancements contribute to robust traffic scene understanding, aiding the development of automated driving systems [19].

R-CNN极大推动了交通场景中的目标检测,准确率在COCO数据集[11]上达到75.6%,超越了传统CNN。它在行人[12]、车辆[13]和交通标志[14][15]检测方面表现出色。级联架构[16]、注意力机制[17]和混合方法[18]等创新进一步提升了性能。这些进展促进了交通场景的稳健理解,有助于自动驾驶系统[19]的发展。

R-CNN, introduced in [10], significantly improves object detection accuracy by combining region proposals with CNNs. It addresses occlusion by using region proposals to better localize objects, even when partially occluded, in contrast to sliding window methods like OverFeat.

R-CNN由文献[10]提出,通过结合区域候选与CNN显著提升了目标检测的准确性。它通过使用区域候选更好地定位目标,即使部分遮挡,也能解决遮挡问题,这与滑动窗口方法如OverFeat形成对比。

In a comparative study, R-CNN outperformed traditional CNN on the COCO dataset,achieving 75.6% accuracy (N= 78) compared to 47.7%(N=78) for CNN . The study,with a 70%30% training-test split,found a significance p-value of 0.041 [11].

在一项比较研究中,R-CNN在COCO数据集上优于传统CNN,达到75.6%准确率(N= 78),而传统CNN为47.7%(N=78)。该研究采用70%30%训练-测试划分,显著性p值为0.041[11]。

The authors of [12] focus on pedestrian detection, achieving a 23.3% miss rate on the Caltech dataset. They handle occlusion challenges by relying on the model's ability to learn from large datasets instead of explicit occlusion modeling, improving accuracy with more training data.

文献[12]的作者专注于行人检测,在Caltech数据集上实现了23.3%的漏检率。他们通过依赖模型从大规模数据集中学习的能力来应对遮挡挑战,而非显式建模遮挡,随着训练数据增多,准确率得到提升。

A method for traffic sign detection using sparse R-CNN [20] is introduced in [18]. On the BCTSDB and TT- 100K datasets,it achieved state-of-the-art performance,with AP50 and AP75 scores of 99.1% and 96.2% for BCTSDB, and 53.1% and 48.7% for TT-100K.

文献[18]介绍了一种基于稀疏R-CNN[20]的交通标志检测方法。在BCTSDB和TT-100K数据集上,该方法取得了最先进的性能,BCTSDB的AP50和AP75分别为99.1%和96.2%,TT-100K的AP50和AP75分别为53.1%和48.7%。

The Bagging R-CNN framework in [17] uses ensemble learning with adaptive sampling to improve object detection in complex traffic scenes. It achieved 58.7% AP and 83.0% AP50 using ResNet50, and 63.0% AP and 87.1% AP50 using the Swin-T [21] backbone.

文献[17]中的Bagging R-CNN框架利用集成学习和自适应采样提升复杂交通场景中的目标检测性能。使用ResNet50骨干网络时,达到58.7% AP和83.0% AP50,使用Swin-T[21]骨干网络时,分别达到63.0% AP和87.1% AP50。

Context R-CNN, as presented in [19], improves stationary surveillance by selecting and storing objects in memory banks by category. This approach enhanced recognition performance on the TJU-DHD-traffic and Pascal VOC datasets, increasing the mean average precision (mAP) by 0.37 compared to conventional methods.

文献[19]提出的Context R-CNN通过按类别选择并存储对象于记忆库,提升了静态监控性能。该方法在TJU-DHD-traffic和Pascal VOC数据集上的识别性能得到增强,平均精度均值(mAP)较传统方法提高了0.37。

Although Vanilla R-CNN is foundational, its two-stage process is slow and inefficient due to sequential processing of region proposals and classification. This results in high computational complexity and latency, making it unsuitable for real-time applications. Additionally, it struggles with small object detection and relies on selective search, leading to redundant regions, while requiring significant memory and complex training procedures, limiting its practical use.

尽管基础的Vanilla R-CNN具有重要意义,但其两阶段流程因对区域提议和分类的顺序处理而导致速度慢且效率低下。这造成了较高的计算复杂度和延迟,使其不适合实时应用。此外,它在小目标检测方面表现不佳,依赖选择性搜索,导致冗余区域,同时需要大量内存和复杂的训练过程,限制了其实用性。

FIGURE 2. Fast R-CNN procedure for object detection in a traffic scene: The model processes the input image by first extracting features for the entire image using a deep convolutional neural network (Deep ConvNet). RoI projections are mapped onto this shared feature map to generate fixed-size feature vectors using RoI pooling. The resulting RoI feature vectors are passed through fully connected layers to produce two outputs: class probabilities (using a softmax layer for classification) and bounding box regression to refine bounding box coordinates. This enables accurate object detection, such as identifying the “Police Car” class and refining the bounding box parameters in the traffic scene, making it suitable for real-time applications.

图2. 交通场景中Fast R-CNN的目标检测流程:模型首先通过深度卷积神经网络(Deep ConvNet)提取整张图像的特征。RoI(Region of Interest)投影映射到共享特征图上,通过RoI池化生成固定大小的特征向量。得到的RoI特征向量经过全连接层,输出两个结果:类别概率(使用softmax层进行分类)和边界框回归以精细调整边界框坐标。这使得在交通场景中能够准确检测目标,如识别“警车”类别并优化边界框参数,适合实时应用。

2) FAST R-CNN

2) FAST R-CNN

Fast R-CNN, introduced in [22], improves R-CNN by using a single forward pass of the image through a CNN to extract feature maps. It classifies object proposals and refines their spatial locations directly from shared feature maps, significantly improving training and testing speed. Fast R-CNN trains the VGG16 network [23] nine times faster and tests 213 times quicker than R-CNN, while achieving higher mAP on the PASCAL VOC 2012 dataset and surpassing SPPnet [24] in accuracy.

Fast R-CNN由文献[22]提出,通过对图像进行单次CNN前向传播提取特征图,改进了R-CNN。它直接从共享特征图对目标提议进行分类和空间位置精细调整,显著提升了训练和测试速度。Fast R-CNN训练VGG16网络[23]的速度比R-CNN快9倍,测试速度快213倍,同时在PASCAL VOC 2012数据集上实现了更高的mAP,并在准确率上超过了SPPnet[24]。

Figure 2 shows the Fast R-CNN approach for traffic scene object detection. Unlike R-CNN, Fast R-CNN uses a single deep CNN,denoted as fCNN ,to extract features from the entire image once,producing a feature map F=fCNN(I) .

图2展示了Fast R-CNN在交通场景目标检测中的方法。与R-CNN不同,Fast R-CNN使用单个深度CNN,记为fCNN,对整张图像只提取一次特征,生成特征图F=fCNN(I)

Region proposals are generated and mapped onto the shared feature map F . Each region proposal ri is converted to a fixed-size feature vector {\mathcal{F}}_{{r}_{i}} = {f}_{\text{RoI_Pooling }}\left( {\mathcal{F},{r}_{i}}\right) using RoI pooling.

生成区域提议并映射到共享特征图F上。每个区域提议ri通过RoI池化转换为固定大小的特征向量{\mathcal{F}}_{{r}_{i}} = {f}_{\text{RoI_Pooling }}\left( {\mathcal{F},{r}_{i}}\right)

Fast R-CNN uses a softmax layer for classification, unlike R-CNN’s SVMs. The probability Pij that region ri belongs to class j is:

Fast R-CNN使用softmax层进行分类,不同于R-CNN的SVM。区域ri属于类别j的概率Pij为:

(4)Pij=fSoftmax (Fri)j,

where j ranges from 1 to C . For bounding box regression, Fast R-CNN refines the bounding box using predicted offsets δx,δy,δw,δh ,calculated as:

其中j的取值范围为1到C。对于边界框回归,Fast R-CNN利用预测的偏移量δx,δy,δw,δh对边界框进行精细调整,计算公式为:

{b}_{i} = {f}_{\text{BBox_Refine }}\left( {{r}_{i},{\delta x},{\delta y},{\delta w},{\delta h}}\right) . \tag{5}

The final output is a set of refined bounding boxes with class labels. Each bounding box is associated with the highest probability from the softmax layer.

最终输出是一组带有类别标签的精细边界框。每个边界框对应softmax层中概率最高的类别。

Fast R-CNN has been successfully applied in various traffic scene tasks, enhancing detection and classification. It detects road surface signs [25], counts and identifies vehicles in challenging scenarios [26], and improves monitoring at intersections [27]. The technology also enables simultaneous detection of pedestrians and cyclists, excelling on urban datasets [28]. Additionally, it has been adapted for event-based vehicle classification and counting, demonstrating its versatility in dynamic traffic environments [29].

Fast R-CNN已成功应用于多种交通场景任务,提升了检测和分类性能。它能够检测路面标志[25],在复杂场景中计数和识别车辆[26],并改善路口监控[27]。该技术还支持行人和骑行者的同时检测,在城市数据集上表现优异[28]。此外,它被改编用于基于事件的车辆分类和计数,展示了其在动态交通环境中的多功能性[29]。

The authors of [28] propose a unified method for concurrent detection of pedestrians and cyclists using a novel UB-MPR detection proposal and a Fast R-CNN-based model. Tested on the Tsinghua-Daimler dataset, it achieves a recall rate of 96.5% at an IoU threshold of 0.5 . The method addresses occlusion effectively by focusing on upper body detection, ensuring accurate detection even with partial occlusion.

文献[28]提出了一种统一方法,利用新颖的UB-MPR检测提议和基于Fast R-CNN的模型,实现行人和骑行者的同时检测。在清华-戴姆勒数据集上测试,召回率达到96.5%,阈值为0.5。该方法通过聚焦上半身检测,有效解决遮挡问题,确保即使部分遮挡也能准确检测。

In [30], a framework combining deformable part models (DPMs) with CNNs and region proposal networks (RPNs) accelerates Fast R-CNN. Tested on the KITTI car benchmark, it shows 70% overlap for true positives across Easy,Moderate,and Hard settings,with up to 12x speed improvement and comparable precision, especially in PASCAL VOC and KITTI assessments.

文献[30]提出了结合可变形部件模型(DPMs)、CNN和区域提议网络(RPNs)的框架,加速Fast R-CNN。在KITTI车辆基准测试中,真实正样本的重叠度在Easy、Moderate和Hard设置下均达到70%,速度提升最高达12x倍,且精度与PASCAL VOC和KITTI评测中表现相当。

In [27], a road user monitoring system for intersections is presented, combining a GMM-based DL approach with geometric warping. Integrated with Fast R-CNN, it processes 0.92s and 0.99s faster on the MIT and Jinan datasets, respectively, with fewer misses in tracking and classification.

文献[27]介绍了一种用于路口的道路使用者监控系统,结合基于GMM的深度学习方法与几何变形。集成Fast R-CNN后,在MIT和济南数据集上的处理时间分别为0.92秒和0.99s秒,跟踪和分类的漏检率更低。

The study in [31] proposes a joint detection framework for pedestrians and cyclists using Fast R-CNN, incorporating techniques like difficult case extraction, multi-layer feature fusion, and shared convolution layers. This deeper architecture outperformed its counterpart, achieving 4.3% and 5.6% higher accuracy in pedestrian and cyclist detection, respectively.

文献[31]提出了一种基于Fast R-CNN的行人和骑行者联合检测框架,结合了困难样本提取、多层特征融合和共享卷积层等技术。该更深层次的架构优于其对应模型,在行人和骑行者检测中分别实现了4.3%和5.6%的精度提升。

In [29], an event-based object detection system using Fast R-CNN with hyperparameter optimization on modified Stanford car and Myanmar cars datasets is introduced. It achieves accurate vehicle classification and counting in real-time event video streaming, with improved accuracy for weddings and precise learning rate assessments on the Myanmar Cars dataset.

文献[29]介绍了一种基于事件的目标检测系统,采用Fast R-CNN并对修改后的斯坦福汽车和缅甸汽车数据集进行了超参数优化。该系统实现了实时事件视频流中的车辆准确分类和计数,在婚礼场景中精度提升,并对缅甸汽车数据集进行了精确的学习率评估。

Fast R-CNN is used in [32] (referred to as "AllLightR-CNN" in our work) to detect moving vehicles in various conditions, such as low light, long shadows, cloudy weather, and dense traffic. It achieves an average computation time of 0.59 seconds with high detection rates, including 98.44% recall, 94.20% accuracy,and 90% precision in both day and night modes. The dataset ("AllLightRCNN_DS" in our work) includes 3975 frames from four YouTube videos, annotated for vehicle detection and classification, featuring occlusion and varying times of day and weather conditions.

文献[32]中使用Fast R-CNN(本文称为“AllLightR-CNN”)检测各种条件下的移动车辆,如低光照、长阴影、多云天气和密集交通。其平均计算时间为0.59秒,检测率高,包括98.44%的召回率、94.20%的准确率和90%的精确率,适用于昼夜模式。数据集(本文称为“AllLightRCNN_DS”)包含来自四个YouTube视频的3975帧,标注了车辆检测和分类,涵盖遮挡及不同时间和天气条件。

Fast R-CNN improves speed over R-CNN by processing the entire image in a single pass but still relies on time-consuming region proposals, limiting real-time performance. While RoI pooling speeds up processing, it can introduce quantization errors, reducing accuracy, especially for small objects. Additionally, relying on external region proposal methods like Selective Search hinders real-time capabilities. Fast R-CNN also requires large labeled datasets and struggles with high-resolution images, where detailed feature extraction is critical.

Fast R-CNN通过一次性处理整张图像提升了速度,但仍依赖耗时的区域提议,限制了实时性能。虽然RoI池化加快了处理速度,但可能引入量化误差,降低了小目标的准确性。此外,依赖外部区域提议方法如选择性搜索阻碍了实时能力。Fast R-CNN还需要大量标注数据集,并且在高分辨率图像中难以提取细节特征。

3) FASTER R-CNN

3) FASTER R-CNN

Faster R-CNN, first introduced in [33], improves upon its predecessors, R-CNN and Fast R-CNN, by incorporating a Region Proposal Network (RPN). This RPN shares full-image convolutional features with the detection network, allowing for nearly cost-free region proposals.

Faster R-CNN首次在文献[33]中提出,通过引入区域提议网络(RPN)改进了其前身R-CNN和Fast R-CNN。该RPN与检测网络共享全图卷积特征,实现了几乎无成本的区域提议。

Figure 3 illustrates the use of Faster R-CNN for object detection in a traffic scene. Faster R-CNN comprises two main modules: a deep fully convolutional network for proposing regions and a Fast R-CNN detector that classifies these regions. Together, these modules form a unified network for efficient object detection in complex environments like traffic scenes.

图3展示了Faster R-CNN在交通场景中的目标检测应用。Faster R-CNN由两个主要模块组成:用于生成区域提议的深度全卷积网络和用于分类这些区域的Fast R-CNN检测器。两者结合形成一个统一网络,实现复杂环境如交通场景中的高效目标检测。

Given an input image, the first step in Faster R-CNN is to pass it through several convolutional and max pooling layers to produce a shared feature map. If the input image is denoted as X ,the convolution operation can be represented as:

给定输入图像,Faster R-CNN的第一步是通过多个卷积层和最大池化层,生成共享特征图。若输入图像表示为X,则卷积操作可表示为:

(6)F=fconv (X)

where F is the resulting feature map.

其中F为生成的特征图。

The RPN takes the shared feature map F and outputs a set of rectangular object proposals, each with an objectness score. The RPN is fully convolutional and simultaneously predicts multiple region proposals at each location. If fRPN(F) represents the RPN operation,the output Y can be represented as:

RPN接收共享特征图F,输出一组带有目标性分数的矩形区域提议。RPN为全卷积网络,能在每个位置同时预测多个区域提议。若fRPN(F)表示RPN操作,输出Y可表示为:

(7)Y={(Pi,si)i=1,,N},

where Pi is the i -th proposed region and si is the corresponding score.

其中Pi为第i个提议区域,si为对应分数。

The proposed regions are reshaped using an RoI pooling layer to provide a fixed-size input to the FC layers. This process is the same in both Fast R-CNN and Faster R-CNN, with the key difference being the source of the regions: external algorithms in Fast R-CNN and an internal RPN in Faster R-CNN. The reshaping by RoI pooling can be mathematically expressed as:

提议区域通过RoI池化层重塑为固定大小,作为全连接层的输入。此过程在Fast R-CNN和Faster R-CNN中相同,关键区别在于区域来源:Fast R-CNN依赖外部算法,Faster R-CNN则由内部RPN生成。RoI池化的数学表达为:

(8)Ri=fROI(Pi,F),

where Ri represents the reshaped region derived from each proposed region Pi and the feature map F .

其中Ri表示从每个提议区域Pi和特征图F重塑得到的区域。

Finally, the reshaped regions are fed into a sequence of FC layers that output the class probabilities and bounding box coordinates. If fc(Ri) represents the classification operation, the overall output of Faster R-CNN can thus be represented as follows:

最后,重塑后的区域输入一系列全连接层,输出类别概率和边界框坐标。若fc(Ri)表示分类操作,则Faster R-CNN的整体输出可表示为:

(9)Y=fc(fROI(fRPN(F),F)).

Faster R-CNN has advanced traffic scene understanding across several applications, improving environmental perception [34], optimizing traffic sign detection [35], and enhancing recognition of police gestures [36]. It has also refined pedestrian detection [37], boosted traffic surveillance [38], and enabled accurate vehicle categorization in traffic surveys [39]. These advancements improve performance in diverse conditions, aiding autonomous driving [40] and supporting real-time traffic analysis in smart cities [41].

Faster R-CNN推动了交通场景理解的多项应用,提升了环境感知[34],优化了交通标志检测[35],增强了警察手势识别[36]。它还改进了行人检测[37],加强了交通监控[38],并实现了交通调查中车辆分类的准确性[39]。这些进展提升了多样条件下的性能,助力自动驾驶[40],支持智能城市中的实时交通分析[41]。

An improved Faster R-CNN for small object detection is proposed in [42], specifically targeting small traffic signs in the TT100K dataset. It achieves a recall rate of 90% and an accuracy rate of 87% . The method addresses occlusion by using multi-scale convolution feature fusion and improved non-maximum suppression (NMS), enhancing the detection of small and partially occluded objects.

[42]中提出了一种改进的Faster R-CNN用于小目标检测,专门针对TT100K数据集中的小型交通标志。该方法实现了90%的召回率和87%的准确率。该方法通过多尺度卷积特征融合和改进的非极大值抑制(NMS)解决了遮挡问题,增强了对小型及部分遮挡目标的检测能力。

In [39], Faster R-CNN is used for vehicle detection in traffic surveys, chosen over SSD [43] and YOLO [44] for its higher accuracy despite slower speed. The authors highlight the advantages of DL over traditional methods, achieving over 87% accuracy in vehicle detection,queue length estimation, and vehicle type classification, even in untrained environments.

[39]中采用Faster R-CNN进行交通调查中的车辆检测,尽管速度较慢,但因其更高的准确率被选用,优于SSD [43]和YOLO [44]。作者强调深度学习(DL)相较传统方法的优势,在未训练环境下实现了车辆检测、排队长度估计和车辆类型分类超过87%的准确率。

FIGURE 3. Faster R-CNN workflow for object detection in a traffic scene: The process starts by passing the image through several convolutional layers to generate a shared feature map. The feature maps are then processed by a region proposal network, which produces a set of region proposals with corresponding objectness scores. These proposed regions are reshaped using RoI pooling to ensure a consistent input size for the fully connected layers. Finally, the reshaped regions are classified into specific object categories, such as sign poles and police cars, and adjusted for accurate bounding boxes. localization, resulting in precise detection of various elements in the traffic scene.

图3. Faster R-CNN在交通场景中进行目标检测的工作流程:该过程首先将图像通过多个卷积层生成共享特征图。随后,区域建议网络(RPN)处理特征图,生成一组带有目标置信度分数的区域建议。通过RoI池化对这些建议区域进行重塑,确保全连接层输入尺寸一致。最后,将重塑后的区域分类为特定目标类别,如标志杆和警车,并调整边界框,实现交通场景中各元素的精确定位和检测。

In [45], Faster R-CNN is enhanced for object detection using hard negative sample mining and a two-channel feature network. By treating complex multi-classification tasks as binary classification, the modified approach achieves a 5% accuracy improvement on the KITTI dataset.

[45]中通过困难负样本挖掘和双通道特征网络增强了Faster R-CNN的目标检测能力。通过将复杂的多分类任务转化为二分类,改进方法在KITTI数据集上实现了5%的准确率提升。

The authors of [40] propose an enhanced Faster R-CNN for traffic sign detection, incorporating feature pyramid fusion, deformable convolution, and ROI Align. Tested under various conditions,it achieved mAP scores of 92.6% in sunny, 90.6% at sunset,and 86.9% on rainy days,outperforming SSD [43], YOLOv2 [46], YOLOv3 [47], and YOLOv5 [48] in low-light and rainy conditions, proving effective for autonomous driving.

[40]的作者提出了一种增强的Faster R-CNN用于交通标志检测,结合了特征金字塔融合、可变形卷积和ROI Align。在多种条件下测试,分别在晴天、日落和雨天获得了92.6%90.6%86.9%的mAP分数,优于SSD [43]、YOLOv2 [46]、YOLOv3 [47]和YOLOv5 [48],在弱光和雨天条件下表现出色,适用于自动驾驶。

In [49], an enhanced Faster R-CNN with ResNet50-D, an attention-guided context feature pyramid network (ACFPN), and AutoAugment technology is proposed for traffic sign detection. Benchmarking against methods like SSD [43] and YOLOv3 [47], it achieved 29.8 FPS and 99.5% mAP on the CCTSDB dataset, surpassing other state-of-the-art methods,with competitive results on the TT100K dataset.

[49]中提出了一种结合ResNet50-D、注意力引导上下文特征金字塔网络(ACFPN)和AutoAugment技术的增强型Faster R-CNN用于交通标志检测。在CCTSDB数据集上,较SSD [43]和YOLOv3 [47]等方法实现了29.8帧每秒和99.5%的mAP,超越了其他先进方法,并在TT100K数据集上取得了竞争性结果。

In [50], a correlation model analyzes haze's impact on traffic sign detection and sight distance, using a synthesized GTSDB dataset. The Faster R-CNN model, post-dehazing, achieved 95.11% detection accuracy. Results show that haze intensity inversely affects sight distance and detection, with accuracies of over 93% at 300 meters in light haze, 88%-93% at 100 meters in haze,and 85%88% at 50 meters in dense haze.

[50]中构建了一个相关模型,分析了雾霾对交通标志检测和视距的影响,使用合成的GTSDB数据集。去雾后的Faster R-CNN模型实现了95.11%的检测准确率。结果表明,雾霾强度与视距和检测准确率呈负相关,轻雾条件下300米处准确率超过93%,雾霾条件下100米处为88%-93%,浓雾条件下50米处为85%88%

In [41], the authors discuss the Intelligent Transportation System (ITS)-oriented Information Acquisition Models (IAMs), using the Mirror Traffic dataset and Internet of Things (IoT) to predict traffic conditions and adjust signals in real-time. By comparing Faster R-CNN to R-CNN in a DL context, they found that Faster R-CNN, with an 85.10% recall and 86.79% accuracy, outperforms R-CNN by 6.20%.

[41]中作者讨论了面向智能交通系统(ITS)的信息获取模型(IAMs),利用Mirror Traffic数据集和物联网(IoT)实时预测交通状况并调整信号。通过在深度学习背景下比较Faster R-CNN与R-CNN,发现Faster R-CNN以85.10%的召回率和86.79%的准确率,较R-CNN高出6.20%。

Faster R-CNN enhances speed by integrating region proposal generation through an RPN and improves occlusion handling, enabling better detection of partially obscured or overlapping objects. Despite these advances, it still faces challenges with real-time processing due to the computational demands of the RPN, high memory usage, and the need for extensive training data. The model struggles with small, overlapping, or occluded objects, and its complexity makes implementation and tuning difficult, limiting its use in data-scarce scenarios.

Faster R-CNN通过集成区域建议网络(RPN)实现了区域建议生成的加速,并改进了遮挡处理能力,提升了对部分遮挡或重叠目标的检测效果。尽管如此,由于RPN的计算需求高、内存占用大及需大量训练数据,仍面临实时处理挑战。该模型在小目标、重叠或遮挡目标检测上存在困难,且其复杂性导致实现和调优较为困难,限制了其在数据匮乏场景中的应用。

4) MASK-R-CNN

4) MASK-R-CNN

Mask R-CNN [51] extends Faster R-CNN by adding a branch for object masks alongside class labels and bounding-box offsets. It achieves fine spatial layout extraction via pixel-to-pixel alignment, addressing limitations in Fast and Faster R-CNN. Retaining a two-stage approach, the first stage uses an RPN to propose regions, while the second stage predicts classes, bounding-box offsets, and binary masks for each RoI. RoI Align ensures precise feature alignment, improving segmentation accuracy. By predicting classification, regression, and segmentation in parallel, Mask R-CNN streamlines the multi-stage pipeline, offering a powerful solution for segmentation tasks.

Mask R-CNN [51]在Faster R-CNN基础上增加了一个用于目标掩码的分支,除了类别标签和边界框偏移外。它通过像素级对齐实现精细的空间布局提取,解决了Fast和Faster R-CNN的局限。保持两阶段方法,第一阶段使用RPN生成区域建议,第二阶段对每个RoI预测类别、边界框偏移和二值掩码。RoI Align确保特征精确对齐,提升分割精度。通过并行预测分类、回归和分割,Mask R-CNN简化了多阶段流程,提供了强大的分割解决方案。

Figure 4 shows Mask R-CNN applied to instance segmentation in a traffic scene. Key components unique to Mask R-CNN are highlighted, excluding those shared with Faster R-CNN (e.g., the backbone, RPN, and bounding-box regression). The focus is on its distinctive elements: the RoI Align operation and the mask prediction process.

图4展示了Mask R-CNN在交通场景中应用于实例分割的过程。突出显示了Mask R-CNN特有的关键组件,排除了与Faster R-CNN共享的部分(如骨干网络、区域建议网络(RPN)和边界框回归)。重点介绍其独特的元素:RoI Align操作和掩码预测过程。

After the RPN identifies potential object bounding box locations in the image, RoI Align warps features from the feature map to a fixed-size representation for each RoI without quantization:

在RPN识别出图像中潜在的目标边界框位置后,RoI Align将特征图中的特征变换为每个RoI的固定大小表示,且不进行量化处理:

x=x×wROI wpooled 

(10)y=y×hROIhpooled ,

FIGURE 4. Mask R-CNN procedure for instance segmentation in a traffic scene: The input image is first processed through an RPN to identify regions of interest. These regions are then refined using the RoI Align operation, which ensures precise feature extraction by avoiding quantization effects, leading to more accurate segmentation. The refined features are passed through fully connected layers for class prediction and bounding box regression. Subsequently, the mask prediction process generates detailed binary segmentation masks for each instance using convolutional layers, producing accurate pixel-level masks. This approach provides high-resolution masks for various objects within the traffic scene, such as vehicles, pedestrians, and traffic signs, enabling precise instance segmentation.

图4. Mask R-CNN在交通场景中进行实例分割的流程:输入图像首先通过RPN处理以识别感兴趣区域。随后,利用RoI Align操作对这些区域进行精细调整,该操作通过避免量化效应确保精确的特征提取,从而实现更准确的分割。调整后的特征通过全连接层进行类别预测和边界框回归。随后,掩码预测过程利用卷积层为每个实例生成详细的二值分割掩码,产生精确的像素级掩码。该方法为交通场景中的各种对象(如车辆、行人和交通标志)提供高分辨率掩码,实现精确的实例分割。

where wROI and hROI represent the width and height of the RoI,and wpooled  and hpooled  represent the width and height after RoI pooling, respectively.

其中wROIhROI分别表示RoI的宽度和高度,wpooled hpooled 分别表示RoI池化后的宽度和高度。

The mask prediction process uses a Mask Head that generates binary masks for each RoI. In Equation 11, {f}_{\text{Conv_2D }\left( \cdot \right) } is the 2D convolution operation on the input features. In Equation 12, ζMask  are raw mask predictions before the sigmoid function,and P(ζMask ) gives final pixel-level mask probabilities within the RoI:

掩码预测过程使用Mask Head为每个RoI生成二值掩码。在公式11中,{f}_{\text{Conv_2D }\left( \cdot \right) }表示对输入特征进行的二维卷积操作。在公式12中,ζMask 为sigmoid函数前的原始掩码预测,P(ζMask )表示RoI内最终的像素级掩码概率:

{\zeta }_{\text{Mask }} = {f}_{\text{Conv_2D }}\left( {x \times \frac{{w}_{\text{ROI }}}{{w}_{\text{pooled }}},y \times \frac{{h}_{\text{ROI }}}{{h}_{\text{pooled }}}}\right) \tag{11}

(12)P(ζMask )=σ(ζMask ).

Mask R-CNN has been applied to various tasks, including floodwater detection on roads [52], traffic sign detection and recognition [53], and train safety through improved obstacle identification [54]. It also supports urban traffic management via vehicle contour detection and tracking [55], and enables accurate vehicle counting to manage congestion [56]. Comparative studies confirm its superior performance in vehicle detection and classification compared to other models [57].

Mask R-CNN已应用于多种任务,包括道路洪水检测[52]、交通标志检测与识别[53]以及通过改进障碍物识别提升列车安全[54]。它还支持通过车辆轮廓检测与跟踪实现城市交通管理[55],并能准确计数车辆以缓解拥堵[56]。对比研究证实其在车辆检测与分类方面优于其他模型[57]。

In [52], a Mask R-CNN-based method for floodwater detection achieves 99.2% classification accuracy and 93.0% segmentation precision on the IDRF dataset [58], outperforming a prior approach [59]. For traffic sign detection and recognition, [53] uses a two-phase method with Mask R-CNN for shape-based detection and Xception [60] for classification on 11,074 Taiwanese traffic signs, achieving 98.45% precision for triangular and 99.73% for circular signs, surpassing YOLOv5 [61].

在文献[52]中,基于Mask R-CNN的洪水检测方法在IDRF数据集[58]上实现了99.2%的分类准确率和93.0%的分割精度,优于先前方法[59]。对于交通标志检测与识别,[53]采用两阶段方法,利用Mask R-CNN进行基于形状的检测,结合Xception网络[60]对11,074个台湾交通标志进行分类,三角形标志的精度为98.45%,圆形标志达到99.73%,均优于YOLOv5[61]。

The ME Mask R-CNN method [54] improves automated train safety by integrating SSwin-Le Transformer, ME-PAPN, and multiscale enhancements, achieving a 91.3% mAP on the TrainObstacle dataset, 11.1% higher than Mask R-CNN, and an average detection rate of 4.2 FPS. It improves small-target detection by 19.35%, though gains for large and occluded targets are limited due to dataset characteristics.

ME Mask R-CNN方法[54]通过集成SSwin-Le Transformer、ME-PAPN及多尺度增强,提升了自动列车安全性能,在TrainObstacle数据集上实现了91.3%的mAP,比Mask R-CNN高出11.1%,平均检测速度为4.2帧每秒。该方法提升了小目标检测19.35%,但由于数据集特性,对大目标和遮挡目标的提升有限。

A comprehensive comparison [57] of Faster R-CNN, Mask R-CNN, and ResNet-50 (R-CNN) on the 3,200-image RCNNs_Detection dataset (cars and jeeps from Kaggle) shows that Faster R-CNN and Mask R-CNN exceed 80% detection accuracy, while ResNet-50 achieves over 75%. This demonstrates their effectiveness in vehicle detection, classification, and counting.

一项综合比较[57]在包含3,200张图像的RCNNs_Detection数据集(来自Kaggle的汽车和吉普车)上评估了Faster R-CNN、Mask R-CNN和ResNet-50(R-CNN),结果显示Faster R-CNN和Mask R-CNN的检测准确率均超过80%,而ResNet-50超过75%,证明了它们在车辆检测、分类和计数方面的有效性。

Mask R-CNN extends Faster R-CNN with a mask prediction branch, enabling instance segmentation and facilitating the detection of occluded objects by distinguishing overlapping instances. However, this extra branch increases computational and memory demands, complicating real-time use and deployment on resource-constrained devices. The model also requires substantial labeled data, longer training times, and can struggle with small objects and complex scenes. Its increased complexity makes implementation, tuning, and debugging more challenging, particularly for custom applications.

Mask R-CNN在Faster R-CNN基础上增加了掩码预测分支,实现了实例分割,并通过区分重叠实例促进遮挡目标的检测。然而,该额外分支增加了计算和内存开销,限制了其实时应用及在资源受限设备上的部署。该模型还需大量标注数据和较长训练时间,对小目标和复杂场景表现较弱。其复杂性提升了实现、调优和调试的难度,尤其是在定制应用中。

C. YOLO

C. YOLO

YOLO is a fast and efficient real-time object detection system that predicts detections in a single pass, unlike R-CNN methods relying on region proposals. Introduced in [44], it treats detection as a regression problem, dividing the image into a grid where each cell predicts bounding boxes, confidence scores, and class probabilities. YOLO generalizes well across domains but struggles with precise localization, especially for small objects. Fast YOLO, the then-fastest general-purpose detector, is also introduced.

YOLO是一种快速高效的实时目标检测系统,与依赖区域建议的R-CNN方法不同,它通过单次前向传播完成检测。文献[44]首次提出,将检测视为回归问题,将图像划分为网格,每个网格单元预测边界框、置信度和类别概率。YOLO在跨领域泛化能力强,但在精确定位,尤其是小目标方面存在不足。文中还介绍了当时最快的通用检测器Fast YOLO。

FIGURE 5. YOLO object detection in a traffic scene featuring a police car: The input image is initially divided into an SxS grid, with each cell predicting bounding boxes, confidence scores, and class probabilities. This process culminates in a final detection display that accurately identifies and localizes the police car and other objects within the scene. A comprehensive legend highlights key objects of interest, providing a clear and detailed overview of the detected items. This real-time object detection approach combines speed and accuracy, making it highly effective for dynamic traffic monitoring applications.

图5. YOLO在包含警车的交通场景中的目标检测:输入图像首先被划分为SxS的网格,每个网格单元预测边界框、置信度分数和类别概率。该过程最终生成一个准确识别并定位警车及场景中其他物体的检测结果显示。详尽的图例突出显示关键目标,提供清晰且详细的检测物体概览。这种实时目标检测方法兼具速度与准确性,非常适合动态交通监控应用。

Figure 5 illustrates YOLO applied to a traffic scene. An input image I is divided into an S×S grid. Each grid cell detects objects if their center falls within it. A convolutional network processes the image and outputs a tensor of shape S×S×(B×5+C) ,where B is the number of bounding boxes per cell,and C is the number of classes.

图5展示了YOLO在交通场景中的应用。输入图像I被划分为一个S×S网格。每个网格单元检测其中心落在其中的物体。卷积网络处理图像并输出形状为S×S×(B×5+C)的张量,其中B是每个单元的边界框数量,C是类别数。

Bounding boxes are defined by(x,y,w,h,s),where(x,y) are the center coordinates relative to the grid cell, w and h are relative to the image size,and s is the confidence score, indicating the likelihood and accuracy of the box containing an object.

边界框由(x,y,w,h,s)定义,其中(x,y)是相对于网格单元的中心坐标,wh相对于图像尺寸,s是置信度分数,表示该框包含物体的可能性和准确性。

Each grid cell predicts C conditional class probabilities, P(cio^) ,for the detected object o^ belonging to class ci . The final prediction combines these probabilities with the confidence score:

每个网格单元预测C个条件类别概率,P(cio^),针对检测到的物体o^属于类别ci。最终预测将这些概率与置信度分数结合:

(13)P(ci)=s×P(cio^)

YOLO applies Non-Maximum Suppression (NMS) to remove redundant overlapping boxes. Predictions are sorted by confidence score,and overlapping boxes (e.g.,IoU > 0.5) with lower scores are removed.

YOLO应用非极大值抑制(NMS)以去除冗余重叠框。预测结果按置信度排序,重叠框(如IoU > 0.5)中置信度较低的被移除。

YOLO’s operation is summarized as o=fNMS(fConv(I)) , where o is the final prediction, fConv  represents convolutional layers,and fNMS applies NMS.

YOLO的操作可总结为o=fNMS(fConv(I)),其中o是最终预测,fConv 代表卷积层,fNMS执行NMS。

YOLO has been adapted for traffic flow counting [62], traffic light detection [63], and traffic sign recognition [64], [65]. It effectively detects pedestrians and vehicles [66] and operates under varied lighting and weather conditions [67], [68]. Continuous improvements enable better small-target detection and high-resolution video processing [69], [70]. Additionally, YOLO versions support license plate identification [71] and have been compared across models for traffic sign detection and vehicle classification in challenging environments [72], [73].

YOLO已被改编用于交通流量计数[62]、交通信号灯检测[63]和交通标志识别[64],[65]。它能有效检测行人和车辆[66],并能在不同光照和天气条件下运行[67],[68]。持续改进提升了小目标检测和高分辨率视频处理能力[69],[70]。此外,YOLO版本支持车牌识别[71],并在复杂环境下对交通标志检测和车辆分类进行了模型比较[72],[73]。

YOLO has seen numerous enhancements and iterations since its inception. The following presents the primary YOLO versions, accompanied by historical insights and a review of relevant scholarly literature.

自诞生以来,YOLO经历了众多改进和迭代。以下介绍主要的YOLO版本,附带历史背景和相关学术文献综述。

1) YOLOV1 (YOLO)

1) YOLOV1(YOLO)

Debuted in 2016, this original YOLO model was ground-breaking as it treated object detection as a singular regression problem, enabling it to predict bounding box coordinates and class probabilities from an image in a single pass [44].

2016年首次亮相,该原始YOLO模型开创性地将目标检测视为单一回归问题,使其能够在一次前向传播中预测边界框坐标和类别概率[44]。

The importance of Traffic Light Detection (TLD) for intelligent vehicles and Driving Assistance Systems is highlighted in [63], which applies YOLO to the daySequence1 from the LISA Traffic Light Dataset. The study achieves a 90.49% AUC, a 50.32% improvement over the previous best using Aggregated Channel Features (ACF), and a 58.3% AUC comparable to the ACF configuration. This underscores TLD's critical role in enhancing self-driving car functionality.

文献[63]强调了交通信号灯检测(TLD)对智能车辆和驾驶辅助系统的重要性,应用YOLO于LISA交通信号灯数据集的daySequence1。研究实现了90.49%的AUC,比之前使用聚合通道特征(ACF)方法提升50.32%,且58.3%的AUC与ACF配置相当,凸显了TLD在提升自动驾驶功能中的关键作用。

2) YOLOv2

2) YOLOv2

Launched in 2017, YOLOv2 [46] introduced major improvements, including detection of over 9000 object categories, the "Darknet-19" architecture, anchor boxes for better bounding box prediction, and multi-scale training. By combining detection-labeled data from COCO with classification data from ImageNet [74], the authors enabled joint classification and detection, creating YOLO9000 [46], capable of detecting a vast range of categories.

2017年发布的YOLOv2[46]带来了重大改进,包括检测9000多个物体类别、采用“Darknet-19”架构、引入锚框以提升边界框预测,以及多尺度训练。通过结合COCO的检测标注数据和ImageNet的分类数据[74],作者实现了联合分类与检测,创造了YOLO9000[46],能够检测广泛类别。

An optimized pedestrian and vehicle detection algorithm based on YOLOv2 is introduced in [66], which improves accuracy while maintaining efficiency. Comparative results on the KITTI dataset demonstrate its real-time capability, outperforming Faster R-CNN and YOLO V2 with 45% accuracy for pedestrians and 61.34% for vehicles at 22 FPS.

文献[66]提出了一种基于YOLOv2的优化行人和车辆检测算法,在保持效率的同时提升准确率。KITTI数据集上的对比结果显示其具备实时能力,行人检测准确率达45%,车辆检测为61.34%,帧率为22 FPS,优于Faster R-CNN和YOLOv2。

3) YOLOv3

3) YOLOv3

Unveiled in 2018, YOLOv3 employed three distinct sizes of anchor boxes for predictions across three scales. It utilized a deeper architecture, "Darknet-53," and expanded its object category detection capabilities. Additionally, it adopted three different sizes of detection kernels (13×13,26×26,52×52) to identify objects of varying dimensions [47].

YOLOv3于2018年发布,采用三种不同尺寸的锚框(anchor boxes)在三个尺度上进行预测。它使用了更深的网络结构“Darknet-53”,并扩展了其目标类别检测能力。此外,还采用了三种不同尺寸的检测核(13×13,26×26,52×52)以识别不同大小的物体[47]。

A YOLOv3-based traffic sign recognition system introduced in [64] achieves a 92.2% mAP for detection on the GTSDB dataset and 99.6% classification accuracy on the GTSRB dataset using a CNN-based classifier. It outperforms Faster R-CNN in both detection accuracy and frame rate, processing images at approximately 9.87 FPS.

文献[64]中提出的基于YOLOv3的交通标志识别系统,在GTSDB数据集上的检测mAP达到92.2%,在GTSRB数据集上使用基于CNN的分类器实现了99.6%的分类准确率。该系统在检测准确率和帧率上均优于Faster R-CNN,处理图像速度约为9.87帧每秒。

4) YOLOv4

4) YOLOv4

Released in 2020, YOLOv4 incorporated numerous improvements over its predecessors. It integrated features such as the "CSPDarknet53-PANet-SPP" architecture, PANet, and SAM block. Additionally, it employed the Complete IoU (CIoU) loss and the Mish activation function, aiming to enhance both speed and accuracy [75].

YOLOv4于2020年发布,相较于前代版本进行了多项改进。它集成了“CSPDarknet53-PANet-SPP”架构、PANet和SAM模块。此外,采用了Complete IoU (CIoU) 损失函数和Mish激活函数,旨在提升速度和准确率[75]。

In [76], traffic sign detection and recognition for smart vehicles are explored using YOLOv4 and YOLOv4-tiny [75] integrated with Spatial Pyramid Pooling (SPP). Results show Yolo V4_1 (with SPP) achieving 99.4% accuracy and 99.32% mAP, while Yolov3 [47] SPP attains 98.99% mAP. These findings indicate that SPP enhances model performance.

文献[76]中,利用YOLOv4和YOLOv4-tiny[75]结合空间金字塔池化(Spatial Pyramid Pooling, SPP)技术,探讨了智能车辆的交通标志检测与识别。结果显示,带SPP的Yolo V4_1实现了99.4%的准确率和99.32%的mAP,而带SPP的Yolov3[47]则达到98.99%的mAP。这些结果表明SPP提升了模型性能。

To address environmental challenges like light intensity, extreme weather, and distance, TSR-YOLO [68], based on YOLOv4-tiny, incorporates Better-ECA (BECA), dense SPP networks, and k-means++ clustering for optimal prior boxes. On the CCTSDB2021 dataset,it achieves 96.62% accuracy, 79.73% recall,an 87.37% F1-score,and a 92.77% mAP, improving over YOLOv4-tiny while maintaining 81 FPS.

为应对光照强度、极端天气和距离等环境挑战,基于YOLOv4-tiny的TSR-YOLO [68]引入了改进的ECA(Better-ECA,BECA)、密集SPP网络和k-means++聚类以优化先验框。在CCTSDB2021数据集上,其准确率达96.62%,召回率79.73%,F1分数87.37%,mAP为92.77%,在保持81 FPS的同时优于YOLOv4-tiny。

A novel semi-automatic method, combining a modified YOLOv4 and background subtraction, is introduced in [77] for unsupervised object detection in surveillance videos. It significantly increases mAP and outperforms state-of-the-art results on the CDnet 2014 and UA-DETRAC datasets, achieving 97.4% precision compared to YOLOv3’s 89% and YOLOv4's 90.4% on the Street corner at night scenario.

文献[77]提出了一种结合改进YOLOv4和背景减除的半自动方法,用于监控视频中的无监督目标检测。该方法显著提升了mAP,并在CDnet 2014和UA-DETRAC数据集上超越了最先进的结果,在夜间街角场景中实现了97.4%的精度,优于YOLOv3的89%和YOLOv4的90.4%。

5) YOLOv5

5) YOLOv5

Introduced in 2020, YOLOv5 [61], developed independently, is not an official continuation by the original YOLO creators. It features a modified "CSPDarknet53" backbone with architectural optimizations to improve speed and real-world applicability. Its naming sparked controversy, as it was not created by the original authors.

YOLOv5 [61]于2020年推出,独立开发,非原YOLO作者的官方续作。其采用改进的“CSPDarknet53”主干网络,进行了架构优化以提升速度和实际应用性能。其命名引发争议,因为并非由原作者创建。

To improve vehicle detection in traffic surveillance videos, [69] proposes an enhanced YOLOv5s model with a small target detection layer and Atrous SPP (ASPP) for multi-scale context, achieving 93.7% precision, 94.2% recall, and 93.9% mAP@0.5—improvements of 0.8%, 1.9%, and 2.3% over the original YOLOv5s, reducing missed and false detections.

为提升交通监控视频中的车辆检测,文献[69]提出了增强版YOLOv5s模型,增加了小目标检测层和空洞空间金字塔池化(Atrous SPP,ASPP)以实现多尺度上下文感知,达到了93.7%的精度、94.2%的召回率和93.9%的mAP@0.5,分别较原YOLOv5s提升0.8%、1.9%和2.3%,减少了漏检和误检。

Ghost-YOLO [70] is a lightweight model for traffic sign detection using the C3Ghost module to replace YOLOv5's feature extraction. It achieves 92.71%mAP while reducing parameters by 91.4% and computations by 50.29%, balancing speed and accuracy for real-world use.

Ghost-YOLO [70]是一种用于交通标志检测的轻量级模型,采用C3Ghost模块替代YOLOv5的特征提取部分。其在实现92.71%mAP的同时,参数量减少了91.4%,计算量降低了50.29%,在速度和精度之间实现了良好平衡,适合实际应用。

6) YOLOv6

6) YOLOv6

Introduced in September 2022, YOLOv6 boasts an efficient design comprising a backbone with RepVGG [78] or the newly introduced "CSPStackRep" blocks, a Path Aggregation Networks (PAN) topology neck, and a decoupled head with a hybrid-channel strategy. It uses advanced quantization techniques, such as post-training quantization and channel-wise distillation, leading to swifter and more precise detectors [79].

YOLOv6于2022年9月发布,设计高效,包含采用RepVGG [78]或新引入的“CSPStackRep”模块的主干网络,路径聚合网络(PAN)结构的颈部,以及采用混合通道策略的解耦头。其使用了先进的量化技术,如训练后量化和通道级蒸馏,提升了检测器的速度和精度[79]。

A license plate identification algorithm is outlined in [71] based on the YOLOv6 convolution model, enhancing efficiency with a 94.7% precision rate for location and a proposed BLPNET(VGG-19-RESNET-50) model achieving 100% F1- score in character recognition, leading to reduced costs and improved traffic management effectiveness.

文献[71]基于YOLOv6卷积模型提出了一种车牌识别算法,通过提高定位效率实现了94.7%的精度,并提出了BLPNET(结合VGG-19和RESNET-50)的字符识别模型,F1分数达到100%,从而降低了成本并提升了交通管理效果。

7) YOLOv7

7) YOLOv7

Released in July 2022, YOLOv7 set new object detection benchmarks, excelling in speed (5)-160 FPS) and accuracy. Trained solely on the MS COCO dataset without pre-trained backbones, it introduced architectural modifications and "bag-of-freebies" to boost accuracy without sacrificing inference speed, though training time increased [80].

YOLOv7于2022年7月发布,刷新了目标检测基准,在速度(5-160 FPS)和精度方面表现卓越。该模型仅在MS COCO数据集上训练,未使用预训练主干网络,采用了架构改进和“免费礼包”(bag-of-freebies)策略,在不牺牲推理速度的前提下提升了准确率,但训练时间有所增加[80]。

An enhanced YOLOv7-WCN network for traffic sign detection [81] improves accuracy from 85.5% to 89.0% by integrating Horblock modules with convolutional layers for efficient mapping, a normalization-based attention module (NAM), and replacing CIoU loss with Wasserstein distance loss [82].

针对交通标志检测,文献[81]提出了增强版YOLOv7-WCN网络,通过将Horblock模块与卷积层结合实现高效映射,采用基于归一化的注意力模块(NAM),并用Wasserstein距离损失替代CIoU损失,将准确率从85.5%提升至89.0%[82]。

8) YOLOv8

8) YOLOv8

Unveiled in January 2023, YOLOv8 [83] offers five scaled versions: YOLOv8n (nano), YOLOv8s (small), YOLOv8m (medium), YOLOv81 (large), and YOLOv8x (extra large), using a backbone similar to the one used in YOLOv5 but with some modifications on the cross-stage partial (CSP) layer. This iteration supports a wide range of computer vision tasks, including object detection, segmentation, pose estimation, tracking, and classification.

YOLOv8于2023年1月发布,提供五个不同规模版本:YOLOv8n(纳米)、YOLOv8s(小型)、YOLOv8m(中型)、YOLOv81(大型)和YOLOv8x(超大型),其主干网络类似于YOLOv5,但对跨阶段部分(CSP)层进行了部分修改。该版本支持广泛的计算机视觉任务,包括目标检测、分割、姿态估计、跟踪和分类。

To address road accidents caused by human error, [72] proposes a traffic sign detection method using YOLOv5s6 [61] and YOLOv8s [83]. Testing on TT100k, TWTS, and a hybrid dataset shows YOLOv8s outperforms YOLOv5s6, achieving an mAP@. 5 of 76.2% compared to 65% on the hybrid dataset.

为解决由人为错误引发的交通事故,[72]提出了一种基于YOLOv5s6 [61]和YOLOv8s [83]的交通标志检测方法。在TT100k、TWTS及混合数据集上的测试表明,YOLOv8s优于YOLOv5s6,在混合数据集上实现了mAP@.5为76.2%,而YOLOv5s6为65%

Recent studies have compared object detection algorithms across diverse environments and datasets. Reference [84] evaluated Faster R-CNN, YOLOv3, and YOLOv4 for aerial car detection on Stanford and PSU datasets, highlighting the impact of dataset characteristics and parameters like input size and learning rate on accuracy. On the PSU dataset, YOLOv3 and YOLOv4 achieved AP scores of 0.965, outperforming Faster R-CNN (0.739).

近期研究比较了不同环境和数据集上的目标检测算法。文献[84]评估了Faster R-CNN、YOLOv3和YOLOv4在斯坦福和PSU数据集上的航空车辆检测,强调了数据集特性及输入尺寸、学习率等参数对准确率的影响。在PSU数据集上,YOLOv3和YOLOv4的AP得分均为0.965,优于Faster R-CNN的0.739。

A broader evaluation of SSD, Faster R-CNN, and YOLO versions (YOLOv8, YOLOv7, YOLOv6, YOLOv5) on nine datasets (referred to as "ShokriCollection_DS" in our work) featuring varied road challenges was conducted in [73]. YOLO, particularly YOLOv7, excelled with over 95% detection accuracy and 90% Overall Accuracy (OA) for vehicle classification, despite differing computation times. Metrics were averaged across high-quality datasets from YouTube and Kaggle.

文献[73]对SSD、Faster R-CNN及多个YOLO版本(YOLOv8、YOLOv7、YOLOv6、YOLOv5)在九个包含多样道路挑战的数据集(本研究称为“ShokriCollection_DS”)上进行了更广泛的评估。YOLO,尤其是YOLOv7,在车辆分类中表现出超过95%的检测准确率和90%的总体准确率(OA),尽管计算时间有所不同。指标基于来自YouTube和Kaggle的高质量数据集取平均值。

Another comparison [85] shows that YOLOv6 (73.5 mAP) and YOLOv8 (71.8 mAP) significantly outperform DETR ( 49.9mAP ) when benchmarking various state-of-the-art object detectors and exploring large computer vision models as image annotators on the RSUD20K dataset for road scene understanding in autonomous driving.

另一项比较[85]显示,在RSUD20K数据集上用于自动驾驶道路场景理解的多种先进目标检测器和大型计算机视觉模型图像标注器的基准测试中,YOLOv6(73.5 mAP)和YOLOv8(71.8 mAP)显著优于DETR(49.9mAP)。

YOLOv1 [44] and YOLOv2 [46] struggled with occlusions due to coarse feature extraction, limiting the detection of overlapping objects. YOLOv3 [47] improved this with multi-scale detection, and YOLOv4 [75] enhanced occlusion handling using feature pyramid networks. Despite these advances, challenges persist. CCW-YOLO [86] improved detection in dense scenes with a lightweight convolutional layer and C2f module,while HCLT-YOLO [87] used a hybrid CNN and transformer to reduce false alarms and missed detections.

YOLOv1 [44]和YOLOv2 [46]因特征提取粗糙,在遮挡情况下表现不佳,限制了对重叠目标的检测。YOLOv3 [47]通过多尺度检测有所改进,YOLOv4 [75]利用特征金字塔网络增强了遮挡处理能力。尽管如此,挑战依然存在。CCW-YOLO [86]通过轻量卷积层和C2f模块提升了密集场景的检测效果,而HCLT-YOLO [87]采用混合CNN与Transformer结构,减少了误报和漏检。

YOLO is fast and efficient for real-time object detection due to its single-shot approach, processing an entire image in one pass. However, this design can struggle with detecting small objects, as the grid-based prediction may lack the precision needed for finer details. YOLO also has difficulty handling overlapping instances, where objects are close together, leading to potential inaccuracies. Additionally, its emphasis on speed may trade off some accuracy, particularly in complex scenes with multiple objects or intricate backgrounds, where precise localization and classification become more challenging.

YOLO因其单次检测(single-shot)方法,能够一次性处理整张图像,因而在实时目标检测中速度快且高效。然而,该设计在检测小目标时存在困难,因为基于网格的预测可能缺乏细节的精确度。YOLO在处理目标重叠(即目标彼此靠近)时也存在挑战,可能导致检测不准确。此外,其对速度的强调可能以牺牲部分准确率为代价,尤其是在多目标或复杂背景的场景中,精确定位和分类更具挑战性。

D. ViT

D. ViT

ViT, introduced in [88], applies transformers to image classification by processing images as patch sequences. Using self-attention, it captures complex dependencies and prioritizes visible features and context, more effectively handling occlusions by inferring hidden objects. Unlike CNNs with localized receptive fields, ViT captures long-range dependencies across the entire image, marking a significant shift in image processing.

ViT(视觉Transformer),由文献[88]提出,将Transformer应用于图像分类,通过将图像视为一系列补丁序列进行处理。利用自注意力机制,ViT捕捉复杂的依赖关系,优先关注可见特征和上下文信息,更有效地处理遮挡,通过推断隐藏目标实现更佳表现。与局部感受野的卷积神经网络(CNN)不同,ViT能够捕获整个图像的长距离依赖,标志着图像处理方式的重大转变。

Figure 6 illustrates the application of ViT to classification in a traffic scene. The initial step in ViT is Image Patching which involves splitting an image into a series of fixed-size patches. These patches are then flattened and linearly embedded. Additionally, position embeddings are added to retain positional information:

图6展示了ViT在交通场景分类中的应用。ViT的第一步是图像分块(Image Patching),即将图像拆分为一系列固定大小的补丁。这些补丁随后被展平并进行线性嵌入。此外,还加入位置嵌入以保留位置信息:

(14)e0=[xclass ;xp1E;xp2E;;xpNE]+Epos ,

where e0 is the initial input embedding to the transformer, xpi is the i -th image patch, E is the patch embedding projection, Epos  are the position embeddings, xclass  is a learnable embedding that serves as a representation of the entire image,and N is the total number of patches.

其中,e0是Transformer的初始输入嵌入,xpi是第i个图像补丁,E是补丁嵌入投影,Epos 是位置嵌入,xclass 是可学习的嵌入,作为整张图像的表示,N是补丁总数。

The embedded patches then pass through a series of Transformer encoder layers. Each layer comprises two main parts: a multi-headed self-attention mechanism (MHSA) and a position-wise feed-forward network (FFN).

嵌入后的补丁随后通过一系列Transformer编码器层。每层包含两个主要部分:多头自注意力机制(MHSA)和逐位置前馈网络(FFN)。

The MHSA is a self-attention mechanism that allows the model to weigh the importance of different patches when processing each patch:

MHSA是一种自注意力机制,允许模型在处理每个补丁时权衡不同补丁的重要性:

(15)fAttention (Q,K,V)=fsoftmax (QKdk)V,

where Q,K,V are the queries,keys,and values,respectively, computed from the input embeddings,and dk is the dimension of the keys.

其中,Q,K,V分别是从输入嵌入计算得到的查询(queries)、键(keys)和值(values),dk是键的维度。

The attention output is processed through multiple "heads," and the results from these heads are concatenated, represented by the symbol :

注意力输出通过多个“头”处理,这些头的结果被拼接,符号表示为

(16)fMultiHead (Q,K,V)=(h1hn)WO,

where the i -th head is computed as:

i个头的计算方式为:

(17)hi=fAttention (QWiQ,KWiK,VWiV),

and WiQ,WiK,WiV ,and WO are learned parameter matrices, and n is the number of attention heads.

其中WiQ,WiK,WiVWO是学习得到的参数矩阵,n是注意力头的数量。

The position-wise FFN consists of two linear transformations with a ReLU activation in between:

位置逐点前馈神经网络(FFN)由两个线性变换组成,中间夹有ReLU激活函数:

(18)fFFN(x)=max(0,xW1+b1)W2+b2,

where x is the input to the FFN, W1 and W2 are weight matrices,and b1 and b2 are bias vectors.

其中x是FFN的输入,W1W2是权重矩阵,b1b2是偏置向量。

After passing through the Transformer layers, the class token's output (from the final layer) is used to predict the class of the image via a simple linear layer:

经过Transformer层后,使用类别标记(class token)在最终层的输出,通过一个简单的线性层预测图像的类别:

(19)o=zclass LWC,

where zclass L is the output corresponding to the class token from the last Transformer layer, WC is the output projection matrix,and o is the output.

其中zclass L是最后一层Transformer中对应类别标记的输出,WC是输出投影矩阵,o是输出结果。

FIGURE 6. Application of ViT in classifying a traffic scene with a crosswalk: The input image is divided into fixed-size patches, which are then flattened and linearly projected into an embedding space. Position embeddings are added to these patch embeddings to retain spatial relationships, along with a class embedding to represent the entire image. The combined embeddings are sequentially processed through the Transformer encoder, involving multiple layers of multi-head self-attention and feed-forward networks. The class token output from the final Transformer layer is passed through an MLP head to predict the class label, in this case, 'crosswalk'

图6. ViT在分类带有斑马线的交通场景中的应用:输入图像被划分为固定大小的图像块,随后被展平并线性投影到嵌入空间。位置嵌入被加到这些图像块嵌入中以保留空间关系,同时加入一个类别嵌入以表示整张图像。组合后的嵌入依次通过Transformer编码器,包含多层多头自注意力和前馈网络。最终Transformer层的类别标记输出通过多层感知机(MLP)头预测类别标签,此处为“斑马线”。

Some applications of ViTs include detecting rain and road surface conditions [89], predicting pedestrian crossing intentions [90], identifying critical traffic moments [91], and detecting unusual traffic scenarios [92].

ViT的一些应用包括检测降雨和路面状况[89]、预测行人过街意图[90]、识别关键交通时刻[91]以及检测异常交通场景[92]。

A cost-effective method for detecting rain and road conditions using ViTs and a Spatial Self-Attention network is presented in [89], achieving F1-scores of 91.13% for rain and 92.10% for road conditions. Adding a sequential detection module improved accuracy to 96.74% and 98.07%, respectively. The study's dataset, referred to as "ViT_DS" in our work, includes 10,000 freeway images from CCTV cameras in Orlando, Florida, labeled for 3 rain levels and 2 road condition levels.

[89]提出了一种结合ViT和空间自注意力网络的经济高效降雨及路况检测方法,降雨和路况的F1分数分别达到91.13%和92.10%。加入序列检测模块后,准确率分别提升至96.74%和98.07%。该研究的数据集(本文称为“ViT_DS”)包含来自佛罗里达奥兰多CCTV摄像头的1万张高速公路图像,标注了3个降雨等级和2个路况等级。

Action-ViT [90] integrates multimodal data-including visual cues, poses, bounding boxes, and action annotations-and employs tailored data processing for each modality, enhancing pedestrian crossing intention prediction and achieving a 90.2% F1-score on the JAAD dataset,with ablation studies confirming improvements in temporal modeling and feature fusion.

Action-ViT[90]融合了多模态数据——包括视觉线索、姿态、边界框和动作注释,并针对每种模态采用定制数据处理,提升了行人过街意图预测,在JAAD数据集上取得了90.2%的F1分数,消融实验验证了时序建模和特征融合的改进效果。

ViT-TA [91] is a custom ViT that achieves 94% accuracy in detecting critical moments at Time-To-Collision (TTC) 1s on the Dashcam Accident Dataset (DAD). It classifies critical traffic situations and uses attention maps to highlight probable causes, systematically enhancing automated vehicle safety by generating reliable safety scenarios.

ViT-TA[91]是一种定制ViT,在Dashcam事故数据集(DAD)上实现了94%的准确率,用于检测碰撞时间(TTC) 1秒的关键时刻。它对关键交通情境进行分类,并利用注意力图突出可能原因,系统性地提升自动驾驶车辆的安全性,通过生成可靠的安全场景实现。

Vit-L [92] detects scenario novelty in traffic using infrastructure images and a triplet autoencoder trained on 70,000 traffic scene and graph pairs in Germany. Enhanced by expert domain knowledge and ViTs, it uses Angle-Based Outlier Detection (ABOD) in the latent space, achieving a 95.6% AUC. The dataset, referred to as "Wurst_DS" in our work and detailed in [93], comprises highway images for outlier model fitting.

Vit-L[92]利用基础设施图像和在德国7万对交通场景与图形对上训练的三元组自编码器检测交通场景的新颖性。结合专家领域知识和ViT,采用基于角度的异常检测(ABOD)在潜在空间中实现,AUC达到95.6%。该数据集(本文称为“Wurst_DS”,详见[93])包含用于异常模型拟合的高速公路图像。

ViTs excel at capturing long-range dependencies in images, allowing for a more holistic understanding of visual data. However, they require large amounts of data and substantial computational power to achieve high performance, making them less accessible in data-limited scenarios. ViTs can struggle with generalizing from smaller datasets, often leading to overfitting or suboptimal results. Additionally, they may be less efficient than CNNs for lower-resolution images, where the advantage of capturing long-range dependencies is diminished, and the computational overhead becomes more pronounced.

ViT擅长捕捉图像中的长距离依赖关系,从而实现对视觉数据的更全面理解。然而,它们需要大量数据和强大计算资源以达到高性能,在数据有限的场景中不够友好。ViT在小规模数据集上往往难以泛化,容易过拟合或表现不佳。此外,对于低分辨率图像,ViT的效率可能不及卷积神经网络(CNN),因为长距离依赖的优势减弱,而计算开销更为显著。

E. DETR

E. DETR

DETR, introduced in [94], is an innovative model for object detection that leverages the Transformer architecture to streamline the process into an end-to-end framework. By treating object detection as a direct set prediction problem, DETR eliminates the need for hand-designed components like non-maximum suppression and anchor generation. The model employs a combination of a CNN for feature extraction and a Transformer for decoding these features into bounding box predictions and class labels in a single forward pass. Leveraging self-attention, DETR models complex relationships between objects and their context, making it particularly effective at detecting and localizing partially occluded objects by interpreting visible fragments within the overall scene. This unified transformer-based approach marks a significant advancement in handling occlusions and simplifying object detection.

DETR[94]是一种创新的目标检测模型,利用Transformer架构将检测过程简化为端到端框架。通过将目标检测视为直接的集合预测问题,DETR消除了非极大值抑制和锚框生成等手工设计组件。该模型结合了用于特征提取的卷积神经网络(CNN)和用于解码特征为边界框预测及类别标签的Transformer,在一次前向传播中完成。借助自注意力机制,DETR建模了目标与其上下文之间的复杂关系,特别擅长通过解析可见碎片检测和定位部分遮挡的目标。这种基于Transformer的统一方法在处理遮挡和简化目标检测方面具有重要突破。

FIGURE 7. DETR object detection in a traffic scene: The process begins with a CNN extracting image features, which are then enhanced with positional encodings to preserve spatial information and processed through a transformer encoder. The encoder employs several layers of self-attention and FFNs to refine these features for improved detection accuracy. The transformer decoder uses a fixed set of learned object queries, combined with the encoded features, to generate predictions for possible objects, including their classes and bounding boxes. Four FFNs are employed to finalize classifications and bounding box coordinates, effectively highlighting detected objects in the scene.

图7. DETR在交通场景中的目标检测:该过程始于卷积神经网络(CNN)提取图像特征,随后通过位置编码增强以保留空间信息,并通过Transformer编码器处理。编码器采用多层自注意力机制和前馈神经网络(FFN)来优化特征,提高检测精度。Transformer解码器使用一组固定的学习目标查询,结合编码特征,生成可能目标的预测,包括类别和边界框。四个FFN用于最终确定分类和边界框坐标,有效突出场景中的检测目标。

Figure 7 depicts the application of DETR to object detection in a traffic scene. DETR starts with feature extraction through which given an input image I ,a fConvNet  is used to extract a feature map F where F=fConvNet (I) , and fConvNet  is a CNN.

图7展示了DETR在交通场景目标检测中的应用。DETR首先通过特征提取,给定输入图像I,使用fConvNet 提取特征图F,其中F=fConvNet (I),且fConvNet 为卷积神经网络(CNN)。

Positional encodings (PEs) are then added to the feature map F to preserve spatial information,resulting in F= F+PE ,where F denotes the feature map with positional encoding added.

随后在特征图F上添加位置编码(PEs)以保留空间信息,得到F= F+PE,其中F表示添加了位置编码的特征图。

At the next step, the Transformer encoder processes this enhanced feature map F through several layers of self-attention and FFNs,resulting in Fz=fEncoder (F) ,where Fz is the encoded feature representation.

下一步,Transformer编码器通过多层自注意力和前馈神经网络处理该增强特征图F,得到Fz=fEncoder (F),其中Fz是编码后的特征表示。

The Transformer decoder uses a set of fixed learned object queries Q and the encoded features Fz to generate predictions:

Transformer解码器使用一组固定的学习目标查询Q和编码特征Fz生成预测:

(20)Q={q1,q2,,qNq}

where Q is the set of Nq learned object queries,and qi represents the i -th query embedding. The output of the decoder is o=fDecoder (Fz,Q) ,where o contains the predictions for potential objects, represented as classes and bounding boxes.

其中Q是一组Nq学习得到的固定目标查询,qi表示第i个查询嵌入。解码器的输出为o=fDecoder (Fz,Q),其中o包含潜在目标的预测,表现为类别和边界框。

The outputs of the Transformer decoder are then processed to yield a fixed-size set of predictions, irrespective of the number of objects in the image. This is represented as:

Transformer解码器的输出随后被处理,生成固定大小的预测集合,与图像中目标数量无关。表示为:

(21)Y={(c^i,b^i)i=1,,N},

where c^i and b^i are the predicted class and bounding box coordinates for the i -th object,and N is the number of predictions.

其中c^ib^i分别是第i个目标的预测类别和边界框坐标,N为预测数量。

The loss function is a crucial part of training DETR, incorporating a unique bipartite matching loss to match predicted and ground truth objects, along with classification and bounding box regression losses.

损失函数是训练DETR的关键部分,包含独特的二分匹配损失,用于匹配预测目标与真实目标,同时包括分类损失和边界框回归损失。

Bipartite matching is used to find the optimal permutation of predicted objects σ that minimizes the matching cost, as described in [94]:

二分匹配用于寻找预测目标σ的最优排列,以最小化匹配成本,如文献[94]所述:

(22)σ=argminσNi=1Nfcost (yi,y^σ(i)),

where N is the set of all permutations of N elements, yi is the ground truth,and y^i is the prediction. Here, fcost  is the cost function that measures the difference between the ground truth objects yi and the predicted objects y^σ(i) .

其中N是所有N元素排列的集合,yi为真实目标,y^i为预测目标。fcost 是成本函数,用于衡量真实目标yi与预测目标y^σ(i)之间的差异。

The loss function L combines the costs of classification and bounding box prediction:

损失函数L结合了分类和边界框预测的成本:

L=i=1N[λclsLcls(c^σ(i),ci)+λbboxLbbox(b^σ(i),bi)],

(23)

where ci and bi are the true class and bounding box of the i -th object,respectively,and λcls and λbbox  are weighting factors that determine the relative importance of the classification loss Lcls  and the bounding box loss Lbbox  . The loss function L is minimized during training to ensure that the predicted outputs closely match the ground truth annotations. This approach helps achieve accurate object detection by focusing on both the correctness of the predicted class and the accuracy of the bounding box localization.

其中cibi分别是真实目标第i个的类别和边界框,λclsλbbox 是权重因子,用以确定分类损失Lcls 和边界框损失Lbbox 的相对重要性。训练过程中最小化损失函数L,确保预测结果与真实标注高度一致。该方法通过同时关注类别正确性和边界框定位精度,实现了准确的目标检测。

During training, DETR minimizes this loss function to learn the parameters that result in the best predictions of object classes and bounding boxes, tailored to match the true objects in the image as closely as possible. This streamlined approach of direct set prediction and loss minimization via bipartite matching distinctly sets DETR apart in the field of object detection.

在训练过程中,DETR通过最小化该损失函数来学习参数,从而实现对目标类别和边界框的最佳预测,尽可能精确地匹配图像中的真实物体。这种通过二分匹配进行直接集合预测和损失最小化的简化方法,使DETR在目标检测领域独树一帜。

DETR's applications are diverse, including enhancements for detecting traffic signs of various sizes [95], recognizing small or weather-affected signs [96], and accelerating model training [97]. Additionally, DETR enhances object detection for autonomous driving by effectively aligning objects with their respective scenes [98].

DETR的应用多样,包括提升对不同尺寸交通标志的检测[95]、识别小型或受天气影响的标志[96]以及加速模型训练[97]。此外,DETR通过有效地将物体与其对应场景对齐,增强了自动驾驶中的目标检测能力[98]。

An innovative approach to traffic sign detection, DSRA-DETR [95], emphasizes enhanced multiscale detection performance through modules that aggregate features across scales, effectively reducing feature noise, preserving low-level features, and boosting the model's ability to recognize objects at various sizes. This results in significant improvements in detection accuracy with impressive APs of 76.13% and 78.24% on GTSDB and CCTSDB datasets, respectively.

一种创新的交通标志检测方法DSRA-DETR[95],通过跨尺度特征聚合模块强调多尺度检测性能的提升,有效减少特征噪声,保留低层特征,增强模型对不同尺寸物体的识别能力。该方法在GTSDB和CCTSDB数据集上分别取得了76.13%和78.24%的显著AP提升。

MTSDet [96] enhances traffic sign detection by using an Attention Mechanism Network (AMNet) and a Path Aggregation Feature Pyramid Network (PAFPN) for multi-scale feature fusion. It excels at detecting small or weather-affected signs, achieving mAP scores of 92.9% on GTSRB and 94.3% on CTSD.

MTSDet[96]通过引入注意力机制网络(AMNet)和路径聚合特征金字塔网络(PAFPN)实现多尺度特征融合,提升了交通标志检测性能。其在检测小型或受天气影响的标志方面表现出色,在GTSRB和CTSD数据集上分别达到了92.9%和94.3%的mAP。

In [97], a Spatially Modulated Co-Attention (SMCA) mechanism improves DETR by focusing co-attention near initial box estimates and integrating multi-head, scale-selection attention. This yields 45.6mAP in 108 epochs, surpassing DETR’s original 43.3mAP in 500 epochs, as verified by extensive ablation studies on COCO.

在[97]中,空间调制共注意力机制(SMCA)通过聚焦于初始边界框估计附近的共注意力,并整合多头尺度选择注意力,提升了DETR性能。该方法在108个训练周期内实现了45.6mAP,优于DETR原始模型500个周期的43.3mAP,这一结果通过在COCO上的广泛消融实验得到验证。

DetectFormer [98] improves autonomous driving object detection by incorporating a ClassDecoder and a Global Extract Encoder (GEE) to enhance category sensitivity and scene alignment. With data augmentation and attention mechanisms, it achieves AP50 and AP75 scores of 97.6% and 91.4%, respectively, on the BCTSDB dataset.

DetectFormer[98]通过引入类别解码器和全局提取编码器(GEE)提升了自动驾驶目标检测的类别敏感性和场景对齐能力。结合数据增强和注意力机制,该方法在BCTSDB数据集上分别实现了97.6%的AP50和91.4%的AP75。

DETR simplifies object detection by eliminating the need for region proposals and streamlining the process with a transformer-based architecture. However, it requires extensive training data to perform well and is computationally intensive, making it challenging to deploy in resource-constrained environments. DETR can also be slower to converge during training, requiring more epochs to reach optimal performance. Additionally, it struggles with detecting small objects in cluttered scenes, where the lack of region proposals can lead to less precise localization and classification.

DETR通过消除区域提议,利用基于Transformer的架构简化了目标检测流程。然而,它需要大量训练数据以达到良好性能,且计算资源消耗较大,难以在资源受限环境中部署。DETR训练收敛较慢,需要更多训练周期以达到最佳表现。此外,在复杂场景中检测小目标时表现欠佳,缺乏区域提议导致定位和分类精度下降。

F. GRAPH NEURAL NETWORK (GNN)

F. 图神经网络(GNN)

GNNs are a powerful tool for traffic scene understanding, representing road networks as graphs and capturing spatial-temporal relationships. They enable precise analysis of vehicle trajectories, pedestrian movements, and interactions, aiding tasks like congestion prediction, collision avoidance, and adaptive signal control. By leveraging graph-based methods, GNNs enhance real-time decision-making in intelligent transportation systems, contributing to safer and more efficient urban mobility.

图神经网络(GNN)是交通场景理解的强大工具,将道路网络表示为图结构,捕捉时空关系。它们能够精确分析车辆轨迹、行人运动及其交互,辅助拥堵预测、碰撞避免和自适应信号控制等任务。通过利用基于图的方法,GNN提升了智能交通系统的实时决策能力,促进了更安全、更高效的城市出行。

1) GCN

1) 图卷积网络(GCN)

GCN, first introduced in [99], is a neural network designed for graph-structured data, extending the concept of convolution from grid-like data (e.g.,images) to graphs. A graph G= (V,E)consists of nodes V(|V|=N) and edges E , represented by an adjacency matrix A ,where Aij=1 if an edge exists between nodes i and j ,and 0 otherwise. Each node viV has a feature vector FiRd ,and the node features collectively form a matrix FRN×d ,where d is the number of features per node.

GCN首次在[99]中提出,是一种针对图结构数据设计的神经网络,将卷积的概念从网格状数据(如图像)扩展到图。一个图G=(V,E)由节点V(|V|=N)和边E组成,用邻接矩阵A表示,其中若节点i和节点j之间存在边,则Aij=1,否则为0。每个节点viV具有一个特征向量FiRd,所有节点特征共同构成矩阵FRN×d,其中d表示每个节点的特征数量。

The core idea of a GCN is to perform a convolution-like operation on a graph. The graph convolution for a single layer is expressed as:

GCN的核心思想是在图上执行类似卷积的操作。单层图卷积表达为:

(24)F(l+1)=σ(A^F(l)W(l)),

where F(l)RN×F(l) is the input feature matrix at layer l , W(l)RF(l)×F(l+1) is the weight matrix, σ is a non-linear activation function (e.g.,ReLU),and A^ is the normalized adjacency matrix with self-loops.

其中 F(l)RN×F(l) 是第 l 层的输入特征矩阵,W(l)RF(l)×F(l+1) 是权重矩阵,σ 是非线性激活函数(例如ReLU),A^ 是带自环的归一化邻接矩阵。

The normalized adjacency matrix A^ is defined as:

归一化邻接矩阵 A^ 定义为:

(25)A^=Δ12A~Δ12,

where A~=A+Id (adjacency matrix with self-loops),and Δ is the degree matrix of A~ with diagonal elements Δii=jA~ij .

其中 A~=A+Id(带自环的邻接矩阵),ΔA~ 的度矩阵,其对角元素为 Δii=jA~ij

A typical GCN model has multiple layers. For instance, a two-layer GCN is:

典型的图卷积网络(GCN)模型包含多层。例如,二层GCN为:

(26)O=fsoftmax (A^XW(0)W(1)),

where X is the input feature matrix, W(0) and W(1) are the weight matrices, σ is the activation function,and O represents the final output, e.g., class probabilities for node classification.

其中 X 是输入特征矩阵,W(0)W(1) 是权重矩阵,σ 是激活函数,O 表示最终输出,例如节点分类的类别概率。

The GCN is trained by minimizing the cross-entropy loss:

GCN通过最小化交叉熵损失进行训练:

(27)L=iYc=1CYiclogOic

where Y is the set of labeled nodes, Y is the label matrix,and C is the number of classes.

其中 Y 是有标签节点集合,Y 是标签矩阵,C 是类别数。

By iteratively updating W(l) using gradient descent,the GCN learns features from the graph structure and node attributes for the target task.

通过迭代使用梯度下降更新 W(l),GCN从图结构和节点属性中学习目标任务的特征。

GCNs excel in traffic scene understanding by modeling complex relationships in graph-structured data. Applications include vehicle behavior classification across datasets [100], recognizing dynamic traffic police gestures [101], interpreting these gestures in real-time [102], understanding police intentions from visual cues [103], and recognizing actions of traffic participants in advanced driver-assistance systems [104].

GCN通过对图结构数据中复杂关系的建模,在交通场景理解中表现出色。应用包括跨数据集的车辆行为分类[100]、识别动态交通警察手势[101]、实时解读这些手势[102]、从视觉线索理解警察意图[103],以及在高级驾驶辅助系统中识别交通参与者动作[104]。

The MR-GCN architecture for vehicle behavior classification [100] achieves sensor invariance and high accuracy: 99% on Apollo, 89% on KITTI, and 84% on Indian datasets. Combining spatial scene graphs and LSTM layers, it encodes spatial-temporal dynamics and outperforms baselines, demonstrating robustness across diverse datasets, even with fewer landmarks.

用于车辆行为分类的MR-GCN架构[100]实现了传感器不变性和高准确率:Apollo数据集99%,KITTI数据集89%,印度数据集84%。该方法结合空间场景图和LSTM层,编码时空动态,优于基线方法,展示了在多样数据集上的鲁棒性,即使地标较少。

In [101], a gesture recognition method focuses on dynamic traffic police gestures using a spatial-temporal GCN (ST-GCN) with attention mechanisms and adaptive graph structures. It achieves 87.72% accuracy on the Chinese Traffic Police Gestures (CTPG) dataset, outperforming existing action-recognition methods.

[101]中提出的手势识别方法聚焦于动态交通警察手势,采用带注意力机制和自适应图结构的时空图卷积网络(ST-GCN)。在中国交通警察手势(CTPG)数据集上达到87.72%的准确率,优于现有动作识别方法。

Pose GCN [102] presents an online activity recognition method employing pose estimation and GCNs to interpret traffic police gestures in real-time frames. It achieves a response time of 716ms and an accuracy rate of 97.52% on the TPGR dataset.

Pose GCN[102]提出了一种在线活动识别方法,结合姿态估计和GCN实时解读交通警察手势。在TPGR数据集上实现了 716ms 的响应时间和 97.52% 的准确率。

In [103], a system for recognizing traffic police intentions from visual cues achieves 87.72% OA on the TPGR dataset. The approach uses OpenPose [105] to extract key points, which are transformed into spatiotemporal maps processed by GCNs and modified transformers.

[103]中提出的基于视觉线索识别交通警察意图的系统,在TPGR数据集上达到 87.72% 的整体准确率(OA)。该方法使用OpenPose[105]提取关键点,转换为空间时间图,由GCN和改进的Transformer处理。

FIGURE 8. GAT-based license plate detection: An image of a car’s rear undergoes convolution to extract essential features, which are then refined by a GAT layer using an attention mechanism to determine the importance of neighboring features. The GAT operates on a graph representation, where each node is associated with a feature vector, and computes attention weights between nodes to aggregate information effectively. The saliency map is produced by fusing these attention-weighted features, guiding an RPN to accurately localize and identify the license plate. This sophisticated setup, combined with attention mechanisms to compute dynamic weights, enhances detection precision, ensuring reliable and accurate identification of license plates under various conditions. The integration of multiple attention heads helps capture different aspects of neighboring relationships, contributing to robustness in feature refinement.

图8. 基于图注意力网络(GAT)的车牌检测:车辆后部图像经过卷积提取关键特征,随后通过GAT层利用注意力机制确定邻近特征的重要性。GAT在图结构上操作,每个节点关联一个特征向量,计算节点间的注意力权重以有效聚合信息。显著性图通过融合这些加权特征生成,引导区域建议网络(RPN)准确定位和识别车牌。该复杂结构结合动态权重计算的注意力机制,提高了检测精度,确保在各种条件下车牌的可靠准确识别。多头注意力机制帮助捕捉邻居关系的不同方面,增强了特征细化的鲁棒性。

The framework in [104] employs 3D human pose estimation and a dynamic adaptive GCN to recognize actions of traffic police, cyclists, and pedestrians. By optimizing object detection and pose estimation modules, it processes multiple objects simultaneously in real traffic scenarios, achieving 80% accuracy on the 3D-HPT dataset.

[104]中的框架采用三维人体姿态估计和动态自适应GCN识别交通警察、骑行者和行人的动作。通过优化目标检测和姿态估计模块,能够在真实交通场景中同时处理多个对象,在3D-HPT数据集上达到 80% 的准确率。

GCNs effectively model complex relationships in non-Euclidean data like graphs but face challenges with scalability due to high computational and memory demands on large graphs. They are prone to over-smoothing, where node features lose distinction after multiple layers, and require careful design to capture long-range dependencies, as standard architectures may not naturally handle distant node relationships.

GCN(图卷积网络)有效地建模了非欧几里得数据如图中的复杂关系,但由于大规模图的高计算和内存需求,面临可扩展性挑战。它们容易出现过度平滑现象,即节点特征在多层传播后失去区分性,并且需要精心设计以捕捉长距离依赖,因为标准架构可能无法自然处理远距离节点关系。

2) GAT

2) GAT

GAT,introduced in [106], represents NN architectures for graph-structured data, incorporating attention mechanisms. Using masked self-attentional layers, GATs overcome the limitations of graph convolutions by allowing nodes to assign varying weights to neighbors' features. This avoids computationally expensive matrix operations like inversion and does not require a priori graph structure knowledge.

GAT(图注意力网络),在文献[106]中提出,代表了针对图结构数据的神经网络架构,融合了注意力机制。通过使用掩码自注意力层,GAT克服了图卷积的局限,使节点能够为邻居特征分配不同权重。这避免了计算代价高昂的矩阵运算如求逆,也不需要事先知道图的结构。

Figure 8 illustrates the GAT-based license plate detection process, where the attention mechanism refines features for accurate license plate identification in traffic scenes. For a graph G=(V,E),V represents the set of nodes,and E represents the set of edges. Each node viV is associated with a feature vector FiRd ,where d is the feature dimension.

图8展示了基于GAT的车牌检测过程,其中注意力机制优化特征以实现交通场景中车牌的准确识别。对于图,G=(V,E),V表示节点集合,E表示边集合。每个节点viV关联一个特征向量FiRd,其中d是特征维度。

The key idea behind GAT is to compute an attention weight for each neighbor of a node and use these weights to aggregate information from the neighbors. GAT employs a self-attention mechanism to calculate the attention weights for each neighboring node vj with respect to node vi , formulated as:

GAT的核心思想是为节点的每个邻居计算注意力权重,并利用这些权重聚合邻居信息。GAT采用自注意力机制计算相对于节点vi的每个邻居节点vj的注意力权重,公式为:

(28)αij=fLeakyReLU (aT[WFiWFj]),

where αij is the attention score, W is a learnable weight matrix, denotes concatenation, a is a shared learnable weight vector,and fLeakyReLU  is a leaky ReLU activation function. Normalized attention weights are obtained using a softmax function:

其中αij是注意力分数,W是可学习的权重矩阵,表示拼接操作,a是共享的可学习权重向量,fLeakyReLU 是带泄漏的ReLU激活函数。归一化的注意力权重通过softmax函数获得:

(29)βij=exp(αij)kNiexp(αik),

where Ni is the set of neighbors of vi ,and βij is the normalized attention weight for neighbor vj . GAT aggregates information from neighbors using the attention weights:

其中Ni是节点vi的邻居集合,βij是邻居节点vj的归一化注意力权重。GAT利用注意力权重聚合邻居信息:

(30)Fi=σ(jNiβijWFj),

where σ is an activation function (e.g.,ReLU),and Fi is the updated feature vector for node vi .

其中σ是激活函数(例如ReLU),Fi是节点vi的更新特征向量。

To capture diverse neighborhood relationships, GAT uses multiple attention heads. Outputs from all heads are concatenated and passed through a learnable weight matrix:

为了捕捉多样的邻域关系,GAT使用多个注意力头。所有头的输出被拼接后通过可学习的权重矩阵:

(31)Fi=(Fi(1)||Fi(2)||Fi(K))W,

where Fi(k) is the output of the k -th attention head, K is the number of heads, and || denotes concatenation.

其中Fi(k)是第k个注意力头的输出,K是头的数量,||表示拼接操作。

GATs allocate attention within scene graphs, enabling precise object detection and tracking. They are applied to detect license plate numbers in urban environments [107], track pedestrians for traffic safety [108], enhance semantic segmentation, anomaly detection, traffic flow analysis, and improve classification accuracy in complex traffic scenes [109].

GAT在场景图中分配注意力,实现精确的目标检测和跟踪。它们被应用于城市环境中的车牌号码检测[107]、行人跟踪以保障交通安全[108]、提升语义分割、异常检测、交通流量分析,并提高复杂交通场景中的分类准确率[109]。

FIGURE 9. GIN procedure for traffic scene matching: The process begins with “Dataset Preprocessing,” where traffic datasets are converted into road scene graphs via a graph prediction module. This step involves cleaning, filtering, and transforming raw traffic data into a structured format suitable for graph construction. Concurrently, "Query Preprocessing" processes a traffic scene query through "Actor" and "Map Components" clusters, forming a scene graph of the specific road described in the input query. This involves identifying and classifying key elements of the traffic scene, such as vehicles, pedestrians, and road features. These preprocessed graphs are then used to construct the “Input Graph” (Graphx) and the “Isomorphic Matching Subgraph” (Graphy). Graph χ represents the overall dataset graph, while Graphy is a subgraph extracted to match the query. The procedure culminates in the “Results,” where matched road scenes are displayed. This final step leverages the GIN to ensure precise matching and accurate representation of road scenes from diverse datasets, providing practitioners with reliable scene matches for analysis and decision-making.

图9. GIN(图同构网络)用于交通场景匹配的流程:流程始于“数据集预处理”,通过图预测模块将交通数据集转换为道路场景图。此步骤包括清洗、过滤和转换原始交通数据,使其适合构建图结构。与此同时,“查询预处理”通过“参与者”和“地图组件”聚类处理交通场景查询,形成输入查询所描述道路的场景图。该过程涉及识别和分类交通场景中的关键元素,如车辆、行人和道路特征。预处理后的图用于构建“输入图”(Graphx)和“同构匹配子图”(Graphy)。图χ代表整体数据集图,Graphy是为匹配查询提取的子图。流程最终在“结果”阶段展示匹配的道路场景。该步骤利用GIN确保精确匹配和准确表示来自不同数据集的道路场景,为实践者提供可靠的场景匹配以供分析和决策。

APSEGAT [107] efficiently detects license plate numbers in crowded urban environments with diverse vehicles and complex scenes. It achieves a superior F-Score of 90% compared to YOLO’s 86% on the AMLPR dataset.

APSEGAT[107]高效检测拥挤城市环境中多样车辆和复杂场景下的车牌号码。在AMLPR数据集上,其F-Score达到90%,优于YOLO的86%。

The GAM tracker [108] employs sparse candidate selection, graph attention maps, and distance matching loss for pedestrian tracking, achieving 94.99% MOTA on the Pets-mf dataset. It addresses pedestrian safety challenges and supports traffic statistics and abnormal behavior analysis in intelligent transportation systems.

GAM跟踪器[108]采用稀疏候选选择、图注意力图和距离匹配损失进行行人跟踪,在Pets-mf数据集上实现了94.99%的MOTA。它解决了行人安全问题,并支持智能交通系统中的交通统计和异常行为分析。

SCENE [109] leverages heterogeneous GNNs and graph convolutions to encode diverse traffic scenarios, achieving 91.17 accuracy in binary node classification tasks on a custom large-scale dataset ("GAT_SCENE") with 22,400 sequences, each containing 3 seconds of temporal history. Performance and transferability are notably enhanced by incorporating edge features into the GAT operator.

SCENE[109]利用异构图神经网络(GNN)和图卷积编码多样化交通场景,在包含22400个序列(每个序列包含3秒时间历史)的定制大规模数据集“GAT_SCENE”上,二元节点分类任务准确率达到91.17%。通过将边特征引入图注意力网络(GAT)算子,显著提升了性能和迁移能力。

GATs enhance GCNs by assigning varying importance to neighboring nodes via an attention mechanism for more nuanced relationship modeling. However, this mechanism is computationally expensive on large graphs, prone to overfitting when focusing on a few nodes, and struggles to capture long-range dependencies due to its local application.

图注意力网络(GAT)通过注意力机制为邻居节点分配不同权重,增强了图卷积网络(GCN)对关系的细致建模。然而,该机制在大规模图上计算开销大,聚焦少数节点时易过拟合,且由于其局部应用,难以捕捉长距离依赖。

3) GIN

3) GIN

GIN, introduced in [110], is a GNN designed to process graph-structured data by effectively capturing intricate graph topology. It enhances node representations by considering both a node's features and its neighbors' contributions, enabling precise structural differentiation and identifying subtle graph differences.

GIN由[110]提出,是一种设计用于处理图结构数据的图神经网络,能够有效捕捉复杂的图拓扑结构。它通过同时考虑节点自身特征及其邻居的贡献,增强节点表示,实现精确的结构区分和细微图差异识别。

Figure 9 illustrates the GIN procedure for matching traffic scenes. Let Fv(0) represent the initial feature vector of node v . GIN begins by aggregating information from neighboring nodes and combining it with the node's own features. The aggregation employs an MLP, a fully connected neural network with multiple layers. The updated node representations are computed using the function fMLP as follows:

图9展示了GIN匹配交通场景的过程。设Fv(0)为节点v的初始特征向量。GIN首先聚合邻居节点信息,并与节点自身特征结合。聚合过程采用多层感知机(MLP),即具有多层的全连接神经网络。更新后的节点表示通过函数fMLP计算如下:

(32)Fv(l)=fMLP((1+ϵ(l))Fv(l1)+uN(v)Fu(l1)),

where N(v) represents the neighbors of node v,ϵ(l) is a learnable parameter,and Fv(l) is the representation of node v at layer l after being processed by the function fMLP .

其中,N(v)表示节点v,ϵ(l)的邻居,Fv(l)是可学习参数,v节点在第l层经过函数fMLP处理后的表示。

In some cases, a readout function aggregates node-level information to obtain graph-level embeddings. For example, graph-level embedding Fgraph  could be computed as the sum of all node embeddings:

在某些情况下,读出函数将节点级信息聚合以获得图级嵌入。例如,图级嵌入Fgraph 可计算为所有节点嵌入的和:

(33)Fgraph =vFv(L)

where L is the number of GIN layers.

其中,L为GIN层数。

GIN iteratively updates node representations using their own features and those of neighboring nodes. A learnable parameter ϵ(l) distinguishes between a node’s own features and its neighbors', enabling GIN to capture complex graph structures. This is particularly effective for graph classification tasks where topology is crucial.

GIN通过迭代更新节点表示,结合节点自身及邻居特征。可学习参数ϵ(l)区分节点自身特征与邻居特征,使GIN能够捕捉复杂图结构。这在拓扑结构关键的图分类任务中尤为有效。

GINs are widely applied in traffic scene understanding, including road scene-graph embedding [111], vehicle and pedestrian path prediction [112], traffic scene retrieval [113], automatic scenario detection [114], and real-time pedestrian path prediction [115].

GIN广泛应用于交通场景理解,包括道路场景图嵌入[111]、车辆与行人路径预测[112]、交通场景检索[113]、自动场景检测[114]及实时行人路径预测[115]。

Roadscene2vec [111] uses GCNs, GINs, and CNNs to enhance road scene-graph analysis for spatial modeling, graph learning, and risk assessment. For collision prediction, it achieves 88.12% (GCN), 80.28% (GIN),and 70.39% (ResNet-50) on 271-syn, and 90.95%, 78.03%, and 80.80%, respectively, on 1043-syn. For subjective risk assessment, evaluating perceived driver risk, it achieves 93.20% (GCN), 85.61% (GIN), and 69.38% (ResNet-50) on 271-syn, and 95.80%,87.84% ,and 90.53% ,respectively,on 1043-syn.

Roadscene2vec[111]结合GCN、GIN和卷积神经网络(CNN)提升道路场景图分析,用于空间建模、图学习和风险评估。碰撞预测在271-syn数据集上分别达到88.12%(GCN)、80.28%(GIN)和70.39%(ResNet-50),在1043-syn上分别为90.95%、78.03%和80.80%。主观风险评估(评估驾驶员感知风险)在271-syn上分别为93.20%(GCN)、85.61%(GIN)和69.38%(ResNet-50),在1043-syn上分别为95.80%,87.84%90.53%

Pishgu [112] introduces a lightweight network combining GINs with attention mechanisms for path prediction. It improves ADE/FDE by up to 42%/61% for vehicles (bird’s-eye view) and 23%/22% for pedestrians (high-angle view). Tested on the ActEV/VIRAT dataset, it achieves ADE and FDE scores of 14.11 and 27.96, making it a key resource for CyberPhysical Systems (CPS) applications.

Pishgu[112]提出结合GIN与注意力机制的轻量级网络用于路径预测。在ActEV/VIRAT数据集上,车辆(鸟瞰视角)和行人(高角度视角)的平均位移误差(ADE)和最终位移误差(FDE)分别提升了42%/61%23%/22%。其ADE和FDE分别为14.11和27.96,是网络物理系统(CPS)应用的重要资源。

RSG-Search [113] is a graph-based traffic scene retrieval system using sub-graph isomorphic searching for actor configurations and semantic relationships. It ensures dataset compatibility (e.g., nuScenes, NEDO), achieving full accuracy with low matching times (0-2 seconds). The RSG dataset includes 500 traffic scenes, 200,000 topological graphs, 6 node types (e.g., vehicle, pedestrian), and 25 relationship categories (e.g., 'passing-by', 'waiting-for').

RSG-Search [113] 是一个基于图的交通场景检索系统,利用子图同构搜索来识别参与者配置和语义关系。它确保数据集兼容性(如 nuScenes、NEDO),实现了全准确率且匹配时间低(0-2秒)。RSG 数据集包含500个交通场景、20万个拓扑图、6种节点类型(如车辆、行人)和25种关系类别(如“经过”、“等待”)。

The study in [114] presents expert-knowledge-aided representation learning for traffic scenarios using GIN and an automatic mining strategy. It enables effective clustering and novel scenario detection without manual labeling, achieving an AUC of 99.1% on a dataset simulated with OpenStreetMap.

文献[114]提出了一种结合专家知识辅助的交通场景表示学习方法,采用图同构网络(GIN)和自动挖掘策略。该方法无需人工标注,实现了有效的聚类和新颖场景检测,在基于OpenStreetMap模拟的数据集上达到了99.1%的AUC值。

CARPe [115] introduces a real-time pedestrian path prediction approach by combining GINs with an agile convolutional NN design. It achieves impressive results with an ADE of 0.80 and FDE of 1.48 on the ETH dataset, significantly improving speed and accuracy for applications such as autonomous vehicles and environmental monitoring.

CARPe [115] 引入了一种实时行人路径预测方法,将GIN与灵活的卷积神经网络设计相结合。在ETH数据集上取得了0.80的平均位移误差(ADE)和1.48的最终位移误差(FDE),显著提升了自动驾驶和环境监测等应用的速度和准确性。

GINs excel at distinguishing graph structures by capturing subtle differences between nodes and edges, making them effective for graph classification. However, they are prone to overfitting with limited data, requiring careful hyperpa-rameter tuning. GINs also face scalability and efficiency challenges on large or complex graphs due to their depth and computational demands.

GIN在捕捉节点和边之间细微差异以区分图结构方面表现出色,使其在图分类任务中非常有效。然而,GIN在数据有限时易过拟合,需谨慎调整超参数。由于其深度和计算需求,GIN在处理大型或复杂图时面临可扩展性和效率挑战。

G. CapsNet

G. CapsNet

CapsNet, introduced in [116], addresses CNN limitations by effectively handling spatial hierarchies between simple and complex objects. It encapsulates feature information (e.g., pose, texture, deformation) into neuron groups called "capsules," which use dynamic routing for enhanced feature representation and recognition. A capsule's activity vector represents the instantiation parameters of an entity (e.g., object or part), with its length indicating the probability of the entity's presence in the input.

CapsNet由文献[116]提出,旨在解决卷积神经网络(CNN)在处理简单与复杂对象空间层次关系上的局限。它将特征信息(如姿态、纹理、变形)封装到称为“胶囊”的神经元组中,利用动态路由机制增强特征表示和识别能力。胶囊的活动向量表示实体(如对象或部件)的实例化参数,其长度表示该实体在输入中存在的概率。

To ensure the output vector of a capsule is a small-length vector if the probability of the entity being present is low and a long-length vector if it is high, a squashing function is used. It is typically defined as:

为了确保当实体存在概率低时胶囊输出向量长度较短,概率高时长度较长,采用了挤压函数。其定义通常为:

(34)vj=sj21+sj2sjsj,

where sj is the total input into the j -th capsule,and ||1 denotes the L1 norm or length of a vector. This function ensures that the length of vj is between 0 and 1,thus representing a probability.

其中sj是输入到第j个胶囊的总和, ||1表示向量的L1范数或长度。该函数保证vj的长度介于0和1之间,从而表示概率。

Dynamic routing allows a capsule to send its output to parent capsules in the next layer based on prediction agreement. Each lower-layer capsule predicts the output of higher-layer capsules using a transformation matrix Wij , expressed as u^ji=Wijui ,where ui is the output of the i - th lower-layer capsule,and u^ji is the prediction for the j -th higher-layer capsule.

动态路由允许胶囊根据预测一致性将输出发送给下一层的父胶囊。每个低层胶囊使用变换矩阵Wij预测高层胶囊的输出,表达为u^ji=Wijui,其中ui是第i个低层胶囊的输出,u^ji是对第j个高层胶囊的预测。

Capsules send outputs to parent capsules based on "routing by agreement," measured by the scalar product between the prediction vector and the parent's output vector. The coupling coefficients cij are then updated based on the agreement and are defined by a softmax function over the initial logits bij :

胶囊基于“路由一致性”将输出发送给父胶囊,该一致性通过预测向量与父胶囊输出向量的点积衡量。耦合系数cij随后根据一致性更新,定义为对初始logitsbij的softmax函数:

(35)cij=exp(bij)kexp(bik).

The total input sj to the j -th capsule is calculated as follows. It is a weighted sum of the predicted outputs:

j个胶囊的总输入sj计算如下,是预测输出的加权和:

(36)sj=iciju^ji

Algorithm 1 shows the dynamic routing procedure to train the coupling coefficients cij between each i -th primary and j -th output capsule [116].

算法1展示了训练每个第i个初级胶囊与第j个输出胶囊之间耦合系数cij的动态路由过程[116]。

FIGURE 10. Traffic scene classification with CapsNet: The input image is first processed through a ReLU convolution layer, followed by 32 primary capsules (’PrimaryCaps’). Capsules, fundamental units of CapsNets, encapsulate feature information like pose and texture, with the length of each capsule’s output vector representing the entity’s presence probability. These primary capsules are linked to 8 traffic capsules (‘TrafficCaps’) via weight matrices Wii , and dynamic routing is applied to selectively send outputs based on agreement. The length of each TrafficCaps vector, constrained by a squashing function to be between 0 and 1, influences the classification loss. ||L2|| normalization further enhances categorization, ensuring robust classification of traffic elements like police cars and pedestrians.

图10. 使用CapsNet进行交通场景分类:输入图像首先经过ReLU卷积层处理,随后进入32个初级胶囊(‘PrimaryCaps’)。胶囊作为CapsNet的基本单元,封装了姿态和纹理等特征信息,每个胶囊输出向量的长度表示实体存在的概率。这些初级胶囊通过权重矩阵Wii连接到8个交通胶囊(‘TrafficCaps’),并应用动态路由根据一致性选择性发送输出。每个TrafficCaps向量的长度通过挤压函数限制在0到1之间,影响分类损失。||L2||归一化进一步增强了分类效果,确保对警车和行人等交通元素的稳健识别。

Algorithm 1 Routing Algorithm to Train the Coupling Coefficients Between Each Primary and Output Capsule

算法1 训练每个初级胶囊与输出胶囊之间耦合系数的路由算法


procedure ROUTING(u^ji,r,l)

过程 ROUTING(u^ji,r,l)

for all capsule \( i \) in layer \( l \) and capsule \( j \) in layer \( \left( {l + 1}\right) \) :
对于层 \( l \) 中的所有胶囊 \( i \) 和层 \( \left( {l + 1}\right) \) 中的胶囊 \( j \) :

bij0

for \( r \) iterations do
进行 \( r \) 次迭代
	for all capsule \( i \) in layer \( l : {c}_{i} \leftarrow  \operatorname{softmax}\left( {b}_{i}\right) \)
	对于层 \( l : {c}_{i} \leftarrow  \operatorname{softmax}\left( {b}_{i}\right) \) 中的所有胶囊 \( i \)

{softmax computes Equation 35}

{softmax 计算方程35}

	for all capsule \( j \) in layer \( \left( {l + 1}\right)  : {s}_{j} \leftarrow  \mathop{\sum }\limits_{i}\left( {{c}_{ij} \cdot  {\widehat{u}}_{j} \mid  i}\right) \)
	对于层 \( \left( {l + 1}\right)  : {s}_{j} \leftarrow  \mathop{\sum }\limits_{i}\left( {{c}_{ij} \cdot  {\widehat{u}}_{j} \mid  i}\right) \) 中的所有胶囊 \( j \)
	for all capsule \( j \) in layer \( \left( {l + 1}\right)  : {v}_{j} \leftarrow  \operatorname{squash}\left( {s}_{j}\right) \)
	对于层 \( \left( {l + 1}\right)  : {v}_{j} \leftarrow  \operatorname{squash}\left( {s}_{j}\right) \) 中的所有胶囊 \( j \)

{squash computes Equation 34}

{squash 计算方程34}

	for all capsule \( i \) in layer \( l \) and capsule \( j \) in layer
	对于层 \( l \) 中的所有胶囊 \( i \) 和层中的胶囊 \( j \)

(l+1):bijbij+u^jivj

return vj

返回 vj


These equations and the architecture help preserve detailed spatial information and enable the network to better understand the relationships and hierarchies between different parts of the objects.

这些方程和架构有助于保留详细的空间信息,使网络能够更好地理解物体不同部分之间的关系和层次结构。

Figure 10 illustrates the CapsNet procedure for traffic scene classification. The third layer, Traffic Capsules (Traf-ficCaps), contains 8 capsules, each as a 16-dimensional vector, fully connected to the previous layer's capsules. Dynamic routing ensures layer communication, while a squashing function bounds output vector lengths between 0 and 1 , indicating entity presence probabilities. Weight matrices Wij classify features into 8 traffic object classes.

图10展示了用于交通场景分类的胶囊网络(CapsNet)过程。第三层“交通胶囊”(Traffic Capsules,TrafficCaps)包含8个胶囊,每个为16维向量,完全连接到前一层的胶囊。动态路由确保层间通信,挤压函数(squashing function)将输出向量长度限制在0到1之间,表示实体存在的概率。权重矩阵 Wij 用于将特征分类为8类交通对象。

CapsNets outperform traditional CNNs in robustness and generalization, excelling at capturing spatial relationships and complex interactions in traffic scenes, such as congested intersections. Applications include traffic sign detection [117], highway scene segmentation [118], and complex scenario recognition [119], offering deeper and more reliable insights into dynamic road environments.

胶囊网络(CapsNets)在鲁棒性和泛化能力上优于传统卷积神经网络(CNN),擅长捕捉交通场景中的空间关系和复杂交互,如拥堵路口。应用包括交通标志检测[117]、高速公路场景分割[118]和复杂场景识别[119],为动态道路环境提供更深入、更可靠的洞察。

TSDCaps [117] addresses CNN limitations for traffic sign detection,achieving 97.6% accuracy on the GTSRB dataset. It improves feature extraction, enhances reliability, and demonstrates resistance to adversarial attacks, making it well-suited for autonomous vehicle applications.

TSDCaps [117] 针对交通标志检测中CNN的局限性, 在GTSRB数据集上实现了 97.6% 的准确率。它提升了特征提取能力,增强了可靠性,并表现出对对抗攻击的抵抗力,非常适合自动驾驶车辆应用。

A scene segmentation model in [118], trained on Auckland Highway Images (AHI), achieves 74.61% accuracy. It enhances scene comprehension using matrix representations for pose and spatial relationships, reduces manual data manipulation, and addresses the challenging Picasso problem.

[118] 中的场景分割模型在奥克兰高速公路图像(Auckland Highway Images,AHI)上训练,达到74.61%的准确率。该模型利用矩阵表示姿态和空间关系,提升了场景理解,减少了手动数据处理,并解决了具有挑战性的毕加索问题。

The authors of [119] proposed "ImprovedCaps," a two-step approach for complex scenes. It enhances traffic sign features through image processing before applying CapsNet for recognition, improving GTSRB accuracy by 2%- 5% in complex scenarios and achieving 96% overall accuracy.

[119] 的作者提出了“ImprovedCaps”,一种针对复杂场景的两步方法。通过图像处理增强交通标志特征后应用胶囊网络进行识别,在复杂场景中将GTSRB准确率提升了2%-5%,整体准确率达到96%。

In [120], "LiuCaps" is introduced for traffic-light sign recognition in autonomous vehicles. Trained on the TL_Dataset, it achieves 98.72% accuracy and a 99.27% F1-score, outperforming traditional CNNs while reducing training data needs and improving spatial relationship handling.

在[120]中,"LiuCaps"被引入用于自动驾驶车辆中的交通信号灯识别。该模型在TL_Dataset上训练,达到98.72%的准确率和99.27%的F1分数,优于传统卷积神经网络(CNN),同时减少了训练数据需求并提升了空间关系处理能力。

CapsNets excel at capturing spatial hierarchies and pose relationships, providing detailed object structure understanding compared to CNNs. However, they are computationally intensive, memory-demanding, and challenging to train, requiring complex optimization. Their limited scalability and efficiency on large datasets hinder adoption in resource-critical, large-scale applications.

胶囊网络(CapsNets)擅长捕捉空间层次结构和姿态关系,相较于卷积神经网络(CNN)提供了更详尽的物体结构理解。然而,它们计算量大、内存需求高且训练复杂,需复杂的优化方法。其在大规模数据集上的可扩展性和效率有限,阻碍了在资源受限的大规模应用中的推广。

H.HPO FOR DISCRIMINATIVE DL ARCHITECTURES

用于判别式深度学习架构的超参数优化(H.HPO)

HPO optimizes model architecture and parameters like learning rate, network structure, and regularization, enhancing accuracy, generalization, and efficiency across datasets. Fine-tuning learning rates, anchor box dimensions (R-CNN, YOLO), and architectures (DETR, ViT) improves performance and pixel-level accuracy. In graph learning, tuning GCN layers and node features enhances spatial and structural relationship capture, improving predictions. Techniques like dropout, early stopping, and batch size optimization combat overfitting and computational constraints, while task-specific tuning strengthens multitask learning and temporal modeling for traffic scene understanding.

超参数优化(HPO)通过调整学习率、网络结构和正则化等模型架构和参数,提升模型在各数据集上的准确性、泛化能力和效率。微调学习率、锚框尺寸(R-CNN、YOLO)和架构(DETR、ViT)可提升性能和像素级准确度。在图学习中,调整图卷积网络(GCN)层数和节点特征增强空间及结构关系的捕捉能力,改善预测效果。采用丢弃法(dropout)、早停(early stopping)和批量大小优化等技术可防止过拟合和计算瓶颈,而任务特定的调优则强化多任务学习和交通场景的时序建模。

The learning rate controls weight updates: higher rates accelerate training but risk instability; lower rates improve precision but slow convergence. Batch size affects stability and memory: larger sizes enhance stability; smaller ones improve generalization. Epochs determine dataset passes: more epochs improve learning but risk overfitting, while fewer reduce training time but may underfit. Momentum accelerates training by smoothing updates, reducing oscillations, and avoiding local minima; higher momentum speeds convergence but risks overshooting, while lower momentum ensures stability. Weight decay (L2 regularization) prevents overfitting by penalizing large weights, promoting simpler models: higher values reduce overfitting but may underfit, while lower values allow flexibility but risk overfitting. Anchor scale adjusts predefined box sizes in object detection: larger scales improve the detection of big objects, while smaller scales enhance accuracy for small objects.

学习率控制权重更新:较高学习率加快训练但可能不稳定;较低学习率提高精度但收敛较慢。批量大小影响稳定性和内存:较大批量提升稳定性;较小批量增强泛化能力。训练轮数(epoch)决定数据集遍历次数:更多轮数提升学习但易过拟合,较少轮数缩短训练时间但可能欠拟合。动量加速训练,通过平滑更新减少振荡,避免陷入局部最优;较高动量加快收敛但可能超调,较低动量保证稳定。权重衰减(L2正则化)通过惩罚大权重防止过拟合,促进模型简化:较高值减少过拟合但可能欠拟合,较低值灵活但易过拟合。锚框尺度调整目标检测中预定义框大小:较大尺度提升大目标检测,较小尺度提高小目标准确率。

In the reviewed Fast R-CNN models, [22] used SGD with mini-batches of 2 images, 128 RoIs, a learning rate of 0.001 decayed after 30k iterations,momentum of 0.9, and weight decay of 0.0005 . Reference [26] employed a 0.001 learning rate, batch size of 128 , and trained for 100 epochs.

在所评述的Fast R-CNN模型中,[22]使用了小批量为2张图像、128个感兴趣区域(RoIs)、学习率为0.001且在30k次迭代后衰减,动量为0.9,权重衰减为0.0005。参考文献[26]采用学习率0.001,批量大小128,训练100个epoch。

For Faster R-CNN,[34] modified anchor scales (802,1122 , 1442 ) and aspect ratios(0.4,0.8,2.5),maintaining nine anchors per position. Reference [28] used a learning rate of 0.001 , batch size of 32, 100 epochs, momentum of 0.9 , and weight decay of 0.0005 .

对于Faster R-CNN,[34]调整了锚框尺度(802,11221442)和长宽比(0.4,0.8,2.5),保持每个位置九个锚框。参考文献[28]使用学习率0.001,批量大小32,训练100个epoch,动量0.9,权重衰减0.0005。

In Mask R-CNN, [51] set the learning rate to 0.02 , weight decay to 0.0001, momentum to 0.9, batch size of 16, and trained for 900k iterations. Reference [57] used a learning rate of 0.001 , batch size of 1 , and trained for 50 epochs.

在Mask R-CNN中,[51]设置学习率为0.02,权重衰减为0.0001,动量为0.9,批量大小16,训练900k次迭代。参考文献[57]使用学习率0.001,批量大小1,训练50个epoch。

For YOLO models, [121] used a learning rate of 0.001 decayed by 0.1 per epoch, momentum of 0.9 , weight decay of 0.0005, 8000 max_batches, and batch size of 32 . Reference [122] used a learning rate of 0.01 , final rate of 0.2 , momentum of 0.937, weight decay of 0.0005, 110 epochs, and batch size of 12. Reference [123] utilized a learning rate of 0.0002,Adam optimizer with β1=0.5,β2=0.999 ,and batch size of 32 .

对于YOLO模型,[121]使用学习率0.001,每个epoch衰减0.1,动量0.9,权重衰减0.0005,最大批次数8000,批量大小32。参考文献[122]采用学习率0.01,最终学习率0.2,动量0.937,权重衰减0.0005,训练110个epoch,批量大小12。参考文献[123]使用学习率0.0002,Adam优化器,β1=0.5,β2=0.999,批量大小32。

Among ViT models,[89] used 16×16 patches, 768 embedding dimension, 12 layers, 12 attention heads, Adam optimizer with a 0.001 learning rate, and 100 epochs. Reference [91] used 32×32 patches,12 attention heads, 768 embedding dimension, and Rectified Adam, training for up to 1000 epochs with early stopping. Reference [92] used 16×16 patches, 6 layers, 8 attention heads, and trained with a 0.0002 learning rate over 80 epochs.

在ViT模型中,[89]使用16×16个补丁,768维嵌入,12层,12个注意力头,Adam优化器,学习率0.001,训练100个epoch。参考文献[91]使用32×32个补丁,12个注意力头,768维嵌入,Rectified Adam优化器,训练最多1000个epoch并采用早停。参考文献[92]使用16×16个补丁,6层,8个注意力头,学习率0.0002,训练80个epoch。

For DETR models, [94] used ResNet-50/101 backbones, a transformer with 6 encoder/decoder layers, 256 hidden dimensions,8 attention heads,AdamW optimizer with 104 learning rate, batch size of 16 , and 300 epochs. Reference [95] used ResNet-50, AdamW optimizer, 0.0001 learning rate, batch size of 4,300 query positions, and data augmentation. Reference [96] used ResNet-50,AdamW, 104 learning rate, batch size of 8, adjusted rates on plateau, and data augmentation. Reference [97] used ResNet-50, AdamW, 104 learning rate,batch size of 32,and 50 epochs with scheduled rate reductions.

对于DETR模型,[94]使用了ResNet-50/101主干网络,包含6层编码器/解码器的变换器,256维隐藏层,8个注意力头,AdamW优化器,学习率为104,批量大小为16,训练300个周期。参考文献[95]使用ResNet-50,AdamW优化器,学习率0.0001,批量大小4,300个查询位置,并进行了数据增强。参考文献[96]使用ResNet-50,AdamW,学习率为104,批量大小8,采用了平台调整学习率和数据增强。参考文献[97]使用ResNet-50,AdamW,学习率为104,批量大小32,训练50个周期并采用了计划性学习率衰减。

I. COMPARISON OF DISCRIMINATIVE DL ARCHITECTURES

一、判别式深度学习架构比较

Here's a shorter version that retains all the key information:

以下是保留所有关键信息的简短版本:

Table 2 compares various discriminative DL architectures. In classification, models are evaluated by overall accuracy for traffic scene recognition tasks. The Receptive Field NN [6], an early DL application for road sign classification, achieved a modest 47.7% accuracy on a custom dataset ("RFNN_TSR"), illustrating early limitations. More recent CNN-based methods, like 2LConvNet ms 108-108 [7], and CapsNet methods, such as LiuCaps [120], reached 97.83% and 98.72% accuracy,respectively,on another traffic sign dataset.

表2比较了各种判别式深度学习架构。在分类任务中,模型通过整体准确率评估交通场景识别性能。感受野神经网络(Receptive Field NN)[6]作为早期用于道路标志分类的深度学习应用,在自定义数据集(“RFNN_TSR”)上取得了47.7%的准确率,反映了早期的局限性。更新的基于卷积神经网络(CNN)的方法,如2LConvNet ms 108-108 [7],以及胶囊网络(CapsNet)方法,如LiuCaps [120],分别在另一交通标志数据集上达到了97.83%和98.72%的准确率。

For object detection, Table 2 highlights performance improvements of Fast R-CNN over standard R-CNN, with AllLightRCNN [32] achieving 94.20% mean accuracy ("All-LightRCNN_DS"), a 16.4% gain over R-CNN's 65.76% mAP ("RCNNs_Detection"). Mask R-CNN and Faster R-CNN achieved mAPs of 74.30% and 76.30% [57]. Faster R-CNN models on COCO 2017 achieved AP scores from 40.2% to 44.0%, with Faster R-CNN-FPN-R101+ (108 epochs) [97] obtaining the highest AP of 44.0% . The YOLO family (YOLOv1 to YOLOv8) showed strong results, with YOLOv7 [73] achieving 98.77% AP ("ShokriCollec-tion_DS") for real-time vehicle detection. YOLOv3 [84] performed well on the PSU dataset for aerial traffic object detection,achieving 96.5% AP,outperforming Faster R-CNN’s 73.9%. DETR models also excelled on COCO- 2017, with SMCA-R50 (108 epochs) [97] achieving an AP of 45.6%, surpassing Faster R-CNN by 1.6%. Results from YOLO and DETR demonstrate their state-of-the-art performance, unifying bounding box prediction and classification in a single step.

在目标检测方面,表2突出显示了Fast R-CNN相较于标准R-CNN的性能提升,AllLightRCNN [32]在“All-LightRCNN_DS”数据集上实现了94.20%的平均准确率(mAP),比R-CNN的65.76%(“RCNNs_Detection”)提升了16.4%。Mask R-CNN和Faster R-CNN分别达到了74.30%和76.30%的mAP [57]。Faster R-CNN模型在COCO 2017数据集上的AP得分介于40.2%至44.0%之间,其中Faster R-CNN-FPN-R101+(108个周期)[97]取得了最高的AP为44.0%。YOLO系列(YOLOv1至YOLOv8)表现强劲,YOLOv7 [73]在实时车辆检测的“ShokriCollection_DS”数据集上达到了98.77%的AP。YOLOv3 [84]在PSU数据集的空中交通目标检测中表现优异,AP为96.5%,超过了Faster R-CNN的73.9%。DETR模型在COCO-2017上同样表现出色,SMCA-R50(108个周期)[97]实现了45.6%的AP,超过Faster R-CNN 1.6%。YOLO和DETR的结果展示了其先进性能,实现了边界框预测与分类的一步统一。

Here's a shorter version that preserves all key details:

以下是保留所有关键细节的简短版本:

For segmentation tasks, the CNN-based SNE-RoadSeg [8] achieved 98.6% accuracy on the R2D dataset, demonstrating high performance for road segmentation. Real-world applications include flood segmentation, where Mask R-CNN [52] achieved 93.0% accuracy on the IDRF dataset [58]. CapsNet-based methods, like U-Net [118], achieved an IoU of 74.61% on the AHI dataset for vehicle-related scene segmentation, benefiting from capsules' ability to preserve spatial hierarchies for robust segmentation.

在分割任务中,基于CNN的SNE-RoadSeg [8]在R2D数据集上达到了98.6%的准确率,展示了道路分割的高性能。实际应用包括洪水分割,Mask R-CNN [52]在IDRF数据集[58]上实现了93.0%的准确率。基于胶囊网络的方法,如U-Net [118],在AHI数据集的车辆相关场景分割中取得了74.61%的交并比(IoU),得益于胶囊网络保持空间层次结构以实现稳健分割的能力。

In traffic action recognition, CNN-based methods achieved modest results,with CPM [102] reaching 63.98% accuracy on the TPGR dataset. GCN-based models, such as Pose GCN [102] and ST-GCN [101], significantly outperformed CNNs with accuracies of 87.72% and 97.52%, respectively, due to their ability to capture spatial-temporal relationships and structural representations.

在交通动作识别中,基于CNN的方法取得了适中结果,CPM [102]在TPGR数据集上达到了63.98%的准确率。基于图卷积网络(GCN)的模型,如Pose GCN [102]和ST-GCN [101],分别以87.72%和97.52%的准确率显著超越了CNN,得益于其捕捉时空关系和结构表示的能力。

TABLE 2. Comparison of discriminative DL models for traffic scene understanding across applications, frameworks, datasets, metrics, and results.

表2. 交通场景理解中判别式深度学习模型在应用、框架、数据集、指标和结果方面的比较。

ApplicationFrameworkVarianceDatasetPerformance MetricResult
ClassificationVanilla CNNReceptive Field NN [6]RFNN_TSRAccuracy47.7%
Multi-scale CNN2LConvNet ms 108-108 [7]TL_DatasetAccuracy97.83%
CNNResNet-50 [111]1043-synAccuracy90.53%
GCNMR-GCN [100]KITTIAccuracy89%
GCNHetEdgeGCN [109]GAT_SCENEAccuracy93.54%
GCNHetEdgeGatedGCN [109]GAT_SCENEAccuracy90.09%
GCNMRGCN [111]1043-synAccuracy95.80%
GATHetEdgeGAT [109]GAT_SCENEAccuracy94.29%
GINMRGIN [111]1043-synAccuracy87.84%
CapsNetImprovedCaps [119]GTSRBAccuracy96%
CapsNetLiuCaps [120]TL_DatasetAccuracy98.72%
Object DetectionCNNResNet50 [57]RCNNs_DetectionmAP65.76%
R-CNNVGG16 [10]VOC2007mAP66.0%
R-CNNZF, VGG16 [32]AllLightRCNN_DSMean Accuracy77.8%
Fast R-CNNAllLightRCNN [32]AllLightRCNN_DSMean Accuracy94.20%
Mask R-CNNME Mask R-CNN [54]TrainObstaclemAP91.3%
Mask R-CNNMask R-CNN [57]RCNNs_DetectionmAP74.30%
Faster R-CNNFaster R-CNN [57]RCNNs_DetectionmAP76.30%
Faster R-CNNResNet-50 [73]ShokriCollection_DSAP54.69%
Faster R-CNNInception v2 [84]PSUAP73.9%
Faster R-CNNFaster R-CNN-FPN-R50 (36 epochs) [97]COCO 2017AP40.2%
Faster R-CNNFaster R-CNN-FPN-R50++ (108 epochs) [97]COCO 2017AP42.0%
Faster R-CNNFaster R-CNN-FPN-R101 (36 epochs) [97]COCO 2017AP42.0%
Faster R-CNNFaster R-CNN-FPN-R101+ (108 epochs) [97]COCO 2017AP44.0%
YOLOv1A custom CNN [63]LISA-dayTrainAUC58.3%
YOLOv2Darknet-19 [63]LISA-dayTrainAUC60.05%
YOLOv3Darknet-53 [63]LISA-dayTrainAUC90.49%
YOLOv3Darknet-53 [84]PSUAP96.5%
YOLOv4CSPDarknet53-PANet-SPP [84]PSUAP96.5%
YOLOv5modified CSPDarknet53 [73]ShokriCollection_DSAP93.85%
YOLOv6EfficientRep [73]ShokriCollection_DSAP92.95%
YOLOv7No pretrained backbone [73]ShokriCollection_DSAP98.77%
YOLOv8A CSPDarknet variant [73]ShokriCollection_DSAP91.23%
ViTVanilla ViT [89]ViT_DSF1-score92.10%
ViTViT-SSA [89]ViT DSF1-score98.07%
ViTViT-TA [91]DADF1-score94%
DETRDSRA-DETR [95]CCTSDBAP78.24%
DETRMTSDet [96]CTSDmAP94.3%
DETRDETR-R50 (500 epochs) [97]COCO 2017AP42.0%
DETRDETR-DC5-R50 (500 epochs) [97]COCO 2017AP43.3%
DETRDeformable DETR-R50, Single-scale [97]COCO 2017AP39.7%
DETRDeformable DETR-R50 (150 epochs) [97]COCO 2017AP45.3%
DETRUP-DETR-R50 (150 epochs) [97]COCO 2017AP40.5%
DETRUP-DETR-R50+ (300 epochs) [97]COCO 2017AP42.8%
DETRSMCA-R50 (108 epochs) [97]COCO 2017AP45.6%
DETRDETR-R101 (500 epochs) [97]COCO 2017AP43.5%
DETRDETR-DC5-R101 (500 epochs) [97]COCO 2017AP44.9%
DETRSMCA-R101 (50 epochs) [97]COCO 2017AP44.4%
DETRDetectFormer [98]BCTSDBAP7591.4%
GATAPSEGAT [107]AMLPRF-Score90%
CapsNetTSDCaps [117]GTSRBAccuracy97.62%
SegmentationCNNSNE-RoadSeg [8]R2DAccuracy98.6%
Mask-R-CNNMask-R-CNN [52]IDRF [58]accuracy93.0%
CapsNetU-Net [118]AHIIoU74.61%
Action RecognitionCNNCPM [102]TPGRAccuracy63.98%
ViTAction-ViT [90]JAADF1-score90.2%
GCNST-GCN [101]CTPGAccuracy87.72%
GCNPose GCN [102]TPGRAccuracy97.52%
GCNOpenPose [103]TPGRAccuracy87.72%
GCNDA-GCN [104]TPGRAccuracy94.70%
Object TrackingYOLOv3Deep SORT [108]Pets-mfMOTA92.08%
GATGAM tracker [108]Pets-mfMOTA94.99%
Path PredictionGINPishgu [112]ActEV/VIRATADE, FDE14.11, 27.96
GINCARPe [115]ETHADE, FDE0.80,1.48
Scene RetrievalGINVF2 Without Optimization [113]RSGMatching Time (s)0-5,000
GINVF2 With Optimization [113]RSGMatching Time (s)0-2.5
GINGNN-based Matching [113]RSGMatching Time (s)0-2.0
Novel Scenario DetectionViTViT-L [92]Wurst_DS [93]AUC95.6%
GINExpert-LaSTS [114]OpenStreetMapAUC99.1%
应用框架方差数据集性能指标结果
分类基础卷积神经网络(Vanilla CNN)感受野神经网络(Receptive Field NN)[6]RFNN_TSR准确率47.7%
多尺度卷积神经网络(Multi-scale CNN)2层卷积网络 ms 108-108 [7]TL_数据集准确率97.83%
卷积神经网络(CNN)ResNet-50 [111]1043-合成准确率90.53%
图卷积网络(GCN)MR-GCN [100]KITTI准确率89%
图卷积网络(GCN)异构边图卷积网络(HetEdgeGCN)[109]GAT_场景准确率93.54%
图卷积网络(GCN)异构边门控图卷积网络(HetEdgeGatedGCN)[109]GAT_场景准确率90.09%
图卷积网络(GCN)MRGCN [111]1043-合成准确率95.80%
图注意力网络(GAT)异构边图注意力网络(HetEdgeGAT)[109]GAT_场景准确率94.29%
图同构网络(GIN)MRGIN [111]1043-合成准确率87.84%
胶囊网络(CapsNet)改进胶囊网络(ImprovedCaps)[119]德国交通标志识别基准(GTSRB)准确率96%
胶囊网络(CapsNet)刘氏胶囊网络(LiuCaps)[120]TL_数据集准确率98.72%
目标检测卷积神经网络(CNN)ResNet50 [57]RCNNs_检测平均精度均值(mAP)65.76%
区域卷积神经网络(R-CNN)VGG16 [10]VOC2007平均精度均值(mAP)66.0%
区域卷积神经网络(R-CNN)ZF,VGG16 [32]AllLightRCNN_数据集平均准确率77.8%
快速区域卷积神经网络(Fast R-CNN)AllLightRCNN [32]AllLightRCNN_数据集平均准确率94.20%
掩码区域卷积神经网络(Mask R-CNN)ME 掩码区域卷积神经网络(ME Mask R-CNN)[54]火车障碍物平均精度均值(mAP)91.3%
掩码区域卷积神经网络(Mask R-CNN)掩码区域卷积神经网络(Mask R-CNN)[57]RCNNs_检测平均精度均值(mAP)74.30%
更快区域卷积神经网络(Faster R-CNN)更快区域卷积神经网络(Faster R-CNN)[57]RCNNs_检测平均精度均值(mAP)76.30%
更快区域卷积神经网络(Faster R-CNN)ResNet-50 [73]Shokri集合_数据集平均精度(AP)54.69%
更快区域卷积神经网络(Faster R-CNN)Inception v2 [84]宾夕法尼亚州立大学(PSU)平均精度(AP)73.9%
更快区域卷积神经网络(Faster R-CNN)更快区域卷积神经网络-FPN-R50(36轮)[97]COCO 2017平均精度(AP)40.2%
更快区域卷积神经网络(Faster R-CNN)更快区域卷积神经网络-FPN-R50++(108轮)[97]COCO 2017平均精度(AP)42.0%
更快区域卷积神经网络(Faster R-CNN)更快区域卷积神经网络-FPN-R101(36轮)[97]COCO 2017平均精度(AP)42.0%
更快区域卷积神经网络(Faster R-CNN)更快区域卷积神经网络-FPN-R101+(108轮)[97]COCO 2017平均精度(AP)44.0%
YOLOv1自定义卷积神经网络 [63]LISA-白天训练曲线下面积(AUC)58.3%
YOLOv2Darknet-19 [63]LISA-白天训练曲线下面积(AUC)60.05%
YOLOv3Darknet-53 [63]LISA-白天训练曲线下面积(AUC)90.49%
YOLOv3Darknet-53 [84]宾夕法尼亚州立大学(PSU)平均精度(AP)96.5%
YOLOv4CSPDarknet53-PANet-SPP [84]宾夕法尼亚州立大学(PSU)平均精度(AP)96.5%
YOLOv5改进的 CSPDarknet53 [73]Shokri集合_数据集平均精度(AP)93.85%
YOLOv6EfficientRep [73]Shokri集合_数据集平均精度(AP)92.95%
YOLOv7无预训练主干网络 [73]Shokri集合_数据集平均精度(AP)98.77%
YOLOv8CSPDarknet 变体 [73]Shokri集合_数据集平均精度(AP)91.23%
ViT原版 ViT (Vision Transformer) [89]ViT_DSF1分数92.10%
ViTViT-SSA [89]ViT DSF1分数98.07%
ViTViT-TA [91]DADF1分数94%
DETRDSRA-DETR [95]CCTSDB平均精度(AP)78.24%
DETRMTSDet [96]CTSD平均精度均值(mAP)94.3%
DETRDETR-R50(500轮)[97]COCO 2017平均精度(AP)42.0%
DETRDETR-DC5-R50(500轮)[97]COCO 2017平均精度(AP)43.3%
DETR可变形 DETR-R50,单尺度 [97]COCO 2017平均精度(AP)39.7%
DETR可变形 DETR-R50(150轮)[97]COCO 2017平均精度(AP)45.3%
DETRUP-DETR-R50(150轮)[97]COCO 2017平均精度(AP)40.5%
DETRUP-DETR-R50+(300轮)[97]COCO 2017平均精度(AP)42.8%
DETRSMCA-R50(108轮)[97]COCO 2017平均精度(AP)45.6%
DETRDETR-R101(500轮)[97]COCO 2017平均精度(AP)43.5%
DETRDETR-DC5-R101(500轮)[97]COCO 2017平均精度(AP)44.9%
DETRSMCA-R101(50轮)[97]COCO 2017平均精度(AP)44.4%
DETRDetectFormer [98]BCTSDBAP7591.4%
图注意力网络(GAT)APSEGAT [107]AMLPRF分数90%
胶囊网络(CapsNet)TSDCaps [117]德国交通标志识别基准(GTSRB)准确率97.62%
分割卷积神经网络(CNN)SNE-RoadSeg [8]R2D准确率98.6%
Mask-R-CNNMask-R-CNN [52]IDRF [58]准确率93.0%
胶囊网络(CapsNet)U-Net [118]AHI交并比 (IoU)74.61%
动作识别卷积神经网络(CNN)CPM [102]TPGR准确率63.98%
ViTAction-ViT [90]JAADF1分数90.2%
图卷积网络(GCN)ST-GCN [101]CTPG准确率87.72%
图卷积网络(GCN)Pose GCN [102]TPGR准确率97.52%
图卷积网络(GCN)OpenPose [103]TPGR准确率87.72%
图卷积网络(GCN)DA-GCN [104]TPGR准确率94.70%
目标跟踪YOLOv3Deep SORT [108]Pets-mfMOTA(多目标跟踪准确率)92.08%
图注意力网络(GAT)GAM 跟踪器 [108]Pets-mfMOTA(多目标跟踪准确率)94.99%
路径预测图同构网络(GIN)Pishgu [112]ActEV/VIRATADE(平均位移误差),FDE(最终位移误差)14.11, 27.96
图同构网络(GIN)CARPe [115]ETHADE(平均位移误差),FDE(最终位移误差)0.80,1.48
场景检索图同构网络(GIN)无优化的 VF2 算法 [113]RSG匹配时间(秒)0-5,000
图同构网络(GIN)优化后的 VF2 算法 [113]RSG匹配时间(秒)0-2.5
图同构网络(GIN)基于图神经网络(GNN)的匹配 [113]RSG匹配时间(秒)0-2.0
新颖场景检测ViTViT-L [92]Wurst_DS [93]曲线下面积(AUC)95.6%
图同构网络(GIN)Expert-LaSTS [114]OpenStreetMap(开放街图)曲线下面积(AUC)99.1%

For object tracking, Deep SORT with YOLOv3 [108] achieved a MOTA of 92.08% on the Pets-mf dataset, while the GAM tracker [108] improved MOTA to 94.99%, demonstrating the impact of attention mechanisms. In scene retrieval, GIN-based models like VF2 and GNN-based Matching [113] achieved complete accuracy with retrieval times of 0-2 seconds. For novel scenario detection, ViT-L [92] and Expert-LaSTS [114] achieved AUCs of 95.6% and 99.1% ,respectively,effectively identifying unusual traffic scenarios.

在目标跟踪方面,结合YOLOv3的Deep SORT [108] 在Pets-mf数据集上实现了92.08%的MOTA,而GAM跟踪器 [108] 将MOTA提升至94.99%,展示了注意力机制的影响。在场景检索中,基于GIN的模型如VF2和基于GNN的匹配 [113] 实现了100%的准确率,检索时间为0-2秒。对于新颖场景检测,ViT-L [92] 和Expert-LaSTS [114] 分别达到了95.6%和99.1%的AUC,有效识别异常交通场景。

IV. GENERATIVE MACHINE LEARNING MODELS

四、生成式机器学习模型

Generative machine learning models are growing increasingly integral in advancing DL for traffic scene understanding. Unlike discriminative models that differentiate between distinct entities, generative models excel in generating new data instances that mimic real-world scenarios. These models are adept at creating realistic images and simulations that can be invaluable in traffic scene analysis.

生成式机器学习模型在推动深度学习(DL)用于交通场景理解方面日益重要。与区分式模型区分不同实体不同,生成式模型擅长生成模拟真实场景的新数据实例。这些模型能够创造逼真的图像和仿真,对于交通场景分析极具价值。

In traffic scene understanding, generative models find applications in synthesizing diverse and complex traffic scenarios for training and evaluation purposes. They can generate varied environmental conditions, and lighting variations, including rare traffic occurrences, offering a robust and comprehensive dataset for training discriminative models. This enhances the ability of DNNs to interpret and respond accurately to dynamic traffic situations. Additionally, generative models can be used in anomaly detection, where they help identify unusual or hazardous traffic conditions by contrasting them with the normative patterns they have learned.

在交通场景理解中,生成式模型应用于合成多样且复杂的交通场景,用于训练和评估。它们可以生成多变的环境条件和光照变化,包括罕见的交通事件,提供强大且全面的数据集以训练区分式模型,提升深度神经网络(DNN)准确解读和响应动态交通状况的能力。此外,生成式模型还可用于异常检测,通过与其学习的正常模式对比,帮助识别异常或危险的交通状况。

The following sections discuss generative ML models shaping traffic scene understanding, from basic GANs to complex hybrids blending generative and discriminative techniques. These models address data generation, realism enhancement, and scenario simulation, advancing DL in intelligent transportation systems. Additionally, we explore HPO for these architectures and evaluate their performance metrics, offering a comprehensive overview.

以下章节讨论塑造交通场景理解的生成式机器学习模型,从基础的生成对抗网络(GAN)到融合生成与区分技术的复杂混合模型。这些模型解决数据生成、真实感增强和场景仿真问题,推动智能交通系统中的深度学习发展。同时,我们探讨这些架构的超参数优化(HPO)并评估其性能指标,提供全面概述。

A. GAN

A. 生成对抗网络(GAN)

A GAN, as introduced first in [124], consists of two NNs, a generator and a discriminator, which are trained simultaneously through adversarial training. The generator creates new data instances that resemble a given dataset, while the discriminator evaluates them for authenticity. This process leads the generator to produce increasingly realistic data samples.

生成对抗网络(GAN)最早由文献[124]提出,由两个神经网络组成:生成器和判别器,通过对抗训练同时优化。生成器创造与给定数据集相似的新数据实例,判别器评估其真实性。该过程促使生成器产出越来越逼真的数据样本。

Figure 11 illustrates the training process of a GAN for application to a traffic scene understanding problem. The generator (G) creates fake data samples from random noise. It can be represented as a function G:zx^ ,where z is a random noise vector sampled from a simple probability distribution (like a Gaussian distribution),and x^ is the generated data.

图11展示了GAN在交通场景理解问题中的训练过程。生成器(G)从随机噪声生成假数据样本。其可表示为函数G:zx^,其中z是从简单概率分布(如高斯分布)采样的随机噪声向量,x^是生成的数据。

The discriminator (D) evaluates the authenticity of a given data sample, determining whether it is real (from the actual dataset) or fake (generated by the generator). It can be represented as a function D:x[0,1] ,where x is a data sample. D(x) represents the probability that x comes from the real dataset (output close to 1) rather than being generated (output close to 0 ).

判别器(D)评估给定数据样本的真实性,判断其是真实数据集中的样本还是生成器生成的假样本。其可表示为函数D:x[0,1],其中x是数据样本,D(x)表示该样本来自真实数据集的概率(输出接近1),而非生成数据(输出接近0)。

The training objective of a GAN is to minimize the following value function, also known as the minimax loss:

GAN的训练目标是最小化以下值函数,也称为极小极大损失:

minGmaxDV(D,G)=Expdata (x)[logD(x)]

(37)+Ezpz(z)[log(1D(G(z)))],

where Expdata (x)[logD(x)] represents the expected value of the log-probability that the discriminator correctly classifies real data,and Ezpz(z)[log(1D(G(z)))] represents the expected value of the log-probability that the discriminator correctly classifies generated data as fake.

其中Expdata (x)[logD(x)]表示判别器正确分类真实数据的对数概率的期望值,Ezpz(z)[log(1D(G(z)))]表示判别器正确将生成数据分类为假样本的对数概率的期望值。

This objective creates a dynamic similar to a tug-of-war between the generator and the discriminator. The generator aims to minimize this objective, while the discriminator seeks to maximize it.

该目标形成了生成器与判别器之间类似拔河的动态。生成器旨在最小化该目标,而判别器则试图最大化它。

In the training process, the generator and discriminator are updated iteratively. The generator learns to produce more realistic data to fool the discriminator, while the discriminator improves its ability to differentiate between real and fake data.

在训练过程中,生成器和判别器交替更新。生成器学习生成更逼真的数据以欺骗判别器,判别器则提升区分真实与假数据的能力。

This adversarial process continues until the generated data is indistinguishable from real data, or until a stopping criterion is met.

该对抗过程持续进行,直到生成数据与真实数据无法区分,或达到停止条件。

GANs are utilized for spatio-temporal traffic state reconstruction [125], enhancing video frame predictions [126] and aiding semantic segmentation [127] for autonomous vehicles, augmenting training data for rare events [128], synthesizing soiling on fisheye camera images [129], improving highway traffic images [130] and road segmentation [131] in adverse weather, and augmenting data to improve classifier generalization [132].

GAN被用于时空交通状态重建 [125]、提升视频帧预测 [126]、辅助自动驾驶车辆的语义分割 [127]、增强罕见事件训练数据 [128]、合成鱼眼摄像头图像污渍 [129]、改善恶劣天气下的高速公路交通图像 [130]及道路分割 [131],并通过数据增强提升分类器泛化能力 [132]。

SoPhie [133], an innovative framework based on GAN, addresses the vital task of path prediction for interacting agents in autonomous scenarios by seamlessly integrating physical and social information through a novel combination of social and physical attention mechanisms, achieving remarkable ADE and FDE scores of 0.70 and 1.43, respectively on the ETH dataset and setting a new standard in trajectory forecasting benchmarks for self-driving cars applications.

SoPhie [133] 是一个基于生成对抗网络(GAN)的创新框架,通过新颖结合社交和物理注意力机制,融合物理与社交信息,解决自动驾驶场景中交互代理的路径预测关键任务,在ETH数据集上分别实现了0.70和1.43的显著平均位移误差(ADE)和最终位移误差(FDE)分数,树立了自动驾驶车辆轨迹预测基准的新标杆。

Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index (SSIM) are fundamental metrics extensively used in the field of image and video quality assessment. These metrics are crucial for quantifying the fidelity and visual quality of images and videos by comparing them to original, uncompressed, or distortion-free versions.

峰值信噪比(PSNR)和结构相似性指数(SSIM)是图像和视频质量评估领域广泛使用的基本指标。这些指标通过将图像和视频与原始、无压缩或无失真版本进行比较,量化其保真度和视觉质量,具有重要意义。

FIGURE 11. the training process of a GAN for application to a traffic scene understanding problem: The generator (G) creates synthetic traffic scenes from a random noise vector z , sampled from a latent distribution such as a Gaussian distribution. This generated scene x^ is then evaluated by a discriminator (D), which also receives real samples from the training dataset. The discriminator's task is to classify the samples as 'fake' or 'real' Through this adversarial process, the generator iteratively improves its ability to produce scenes that are indistinguishable from real-world traffic scenes.

图11. 用于交通场景理解问题的生成对抗网络(GAN)训练过程:生成器(G)从潜在分布(如高斯分布)采样的随机噪声向量z生成合成交通场景。该生成场景x^随后由判别器(D)评估,判别器同时接收训练数据集中的真实样本。判别器的任务是将样本分类为“伪造”或“真实”。通过这一对抗过程,生成器迭代提升其生成与真实交通场景难以区分的能力。

The TSR-GAN model proposed in [125] effectively mines and estimates traffic correlations and patterns, setting a new benchmark for spatio-temporal traffic state reconstruction. In comprehensive comparisons, TSR-GAN excels by achieving the highest traffic state similarity (TSS), formulated as TSS=(PSNR+MSSIM×100)/2 ,with a score of 32.595. Additionally, it yields the lowest errors, including a root mean square error (RMSE) of 6.585, a mean absolute error (MAE) of 5.205, and a mean absolute percentage error (MAPE) of 8.671%. These results surpass models such as GASM, CED, SRGAN, and its variations, demonstrating TSR-GAN's superior precision and versatility in reconstructing traffic states under diverse conditions.

文献[125]提出的TSR-GAN模型有效挖掘并估计交通相关性和模式,在时空交通状态重建方面树立了新基准。在全面比较中,TSR-GAN以32.595的交通状态相似度(TSS)得分表现卓越。此外,其误差指标最低,包括均方根误差(RMSE)6.585、平均绝对误差(MAE)5.205和平均绝对百分比误差(MAPE)8.671%,优于GASM、CED、SRGAN及其变体,展现了TSR-GAN在多样条件下重建交通状态的卓越精度和适应性。

The study in [126] evaluates the effectiveness of GAN-based enhancement methods, specifically SRGAN [134] and DeblurGAN [135], in refining video frame predictions made by another generative model, FutureGAN [136], to significantly improve object detection for autonomous vehicles,demonstrating a notable 9% increase in AP for car detection with the enhanced frames.

文献[126]评估了基于GAN的增强方法,特别是SRGAN [134]和DeblurGAN [135],在提升另一生成模型FutureGAN [136]的视频帧预测质量方面的有效性,显著提高了自动驾驶车辆目标检测的性能,增强帧在车辆检测的平均精度(AP)上表现出显著提升9%

A modified CycleGAN [137] introduced in [128] effectively demonstrates the use of GANs for augmenting training data for rare events in autonomous systems, achieving an improvement in mAP for perception tasks from 44.5% to 45.5% . This indicates a credible approach to enhancing the robustness of object detection and scenario classification, while also tackling the issue of training data scarcity and proposing methods to reduce GAN-induced bias. The dataset of this work (referred to as "RareEvents_DS" in our work), collected over two months on California highways, includes 8 hours of driving data, 3959 pixel-wise annotated images, and 600 event-annotated video clips.

文献[128]中引入的改进型CycleGAN [137]有效展示了GAN在增强自动驾驶系统中罕见事件训练数据的应用,使感知任务的平均精度均值(mAP)从44.5%提升至45.5%。这表明该方法在提升目标检测和场景分类的鲁棒性方面具有可信性,同时解决了训练数据稀缺问题并提出减少GAN引入偏差的方法。该研究数据集(在本工作中称为“RareEvents_DS”)采集于加州高速公路,包含8小时驾驶数据、3959张像素级标注图像及600个事件标注视频片段。

The authors in [129] propose two algorithms for soiling synthesis on fisheye camera images. The first is a CycleGAN-based baseline [137], and the second is DirtyGAN. Both algorithms deliver comparable end-to-end results. Dirty-GAN, a GAN-based approach, improves soiling detection by 18% and increases the mean IoU (mIoU) to 91.71% by training on a combination of real and synthetic images. This approach mitigates semantic segmentation degradation caused by soiled data, eliminates manual annotation costs by automatically generating soiling masks, and introduces the Dirty Cityscapes dataset, leveraging the original Cityscapes dataset.

文献[129]提出了两种鱼眼相机图像污渍合成算法。第一种为基于CycleGAN的基线方法[137],第二种为DirtyGAN。两种算法在端到端结果上表现相当。基于GAN的DirtyGAN通过结合真实与合成图像训练,将污渍检测提升了18%,并将平均交并比(mIoU)提高至91.71%。该方法缓解了污渍数据导致的语义分割性能下降,自动生成污渍掩码,消除了人工标注成本,并引入了基于原始Cityscapes数据集的Dirty Cityscapes数据集。

The study in [130] introduces a highly effective highway traffic image enhancement algorithm for adverse weather conditions, achieving remarkable performance gains of 21.97% and 12.89% in nighttime enhancement, 26.16% and 12.75% in rain removal, and 26.56% and 12.1% in fog removal for PSNR and SSIM metrics respectively, showcasing its superior capability in detail retention and noise reduction.

文献[130]提出了一种针对恶劣天气条件下高速公路交通图像的高效增强算法,在夜间增强中分别实现了21.97%和12.89%的PSNR和SSIM显著提升,雨天去除中分别提升26.16%和12.75%,雾天去除中分别提升26.56%和12.1%,展现了其在细节保留和噪声抑制方面的卓越能力。

In [127] (referred to as "MTPanClass" in our work), a model is proposed to refine the segmentation of target main bodies by leveraging the pan-class intrinsic relevance among multiple targets. This approach includes a novel use of generative adversarial learning, which integrates intrinsic relevance features with semantic features to enhance segmentation. MTPanClass achieves mIoU scores of 49.8%, 64.6%,76.01% ,and 89.3 % on the ADE20K,PASCALCon-text, KITTI, and Cityscapes datasets, respectively, demonstrating strong performance and adaptability to complex scene contexts.

在文献[127]中(在我们的工作中称为“MTPanClass”),提出了一种模型,通过利用多个目标之间的泛类内在相关性来细化目标主体的分割。该方法创新性地采用了生成对抗学习,将内在相关性特征与语义特征相结合以增强分割效果。MTPanClass在ADE20K、PASCALContext、KITTI和Cityscapes数据集上分别实现了49.8%、64.6%,76.01%和89.3%的mIoU分数,展示了其在复杂场景上下文中的强大性能和适应性。

IEC-Net, presented in [131], is an image enhancement network based on CycleGAN [137], specifically designed to improve road segmentation under diverse weather conditions. When tested on the Cityscapes dataset under severe weather scenarios, IEC-Net achieved an mIoU of 89.3%, showcasing significant improvements in segmentation accuracy when integrated with state-of-the-art segmentation models.

IEC-Net在文献[131]中提出,是一种基于CycleGAN [137]的图像增强网络,专门设计用于提升多种天气条件下的道路分割效果。在Cityscapes数据集的恶劣天气场景测试中,IEC-Net实现了89.3%的mIoU,显示出与最先进分割模型结合时分割精度的显著提升。

The AttGAN model proposed in [138] is utilized in [132] to introduce a novel data augmentation approach. This method leverages attribute-conditioned generative models to semantically modify training data, significantly enhancing the generalization capabilities of deep classifiers across varying times of day and weather conditions. Notably, this approach achieved an F1-score of 86% on the BDD dataset for Semantic DA when trained using original day images along with synthetic night images.

文献[138]中提出的AttGAN模型被用于文献[132]中引入一种新颖的数据增强方法。该方法利用属性条件生成模型对训练数据进行语义修改,显著提升了深度分类器在不同时间和天气条件下的泛化能力。值得注意的是,该方法在BDD数据集的语义域适应任务中,使用原始白天图像和合成夜间图像训练时,达到了86%的F1分数。

GANs are highly effective for generating realistic data, including high-quality images and other forms of synthetic content. However, they are notoriously difficult to train, often facing challenges such as instability and mode collapse, where the model produces limited variations of the data. Successful training of GANs requires meticulous tuning of hyperparameters and network architecture, as well as access to large datasets. These factors make GANs computationally intensive and difficult to scale, especially for complex or high-resolution tasks.

生成对抗网络(GAN)在生成逼真数据(包括高质量图像及其他合成内容)方面非常有效。然而,GAN训练过程 notoriously 困难,常面临不稳定和模式崩溃(mode collapse)等挑战,即模型生成的数据变异性有限。成功训练GAN需要精细调整超参数和网络结构,并依赖大规模数据集。这些因素使得GAN计算资源消耗大且难以扩展,尤其是在处理复杂或高分辨率任务时。

B. cGAN

B. 条件生成对抗网络(cGAN)

A cGAN is an extension of the standard GAN that allows for the conditional generation of data based on input labels or information. The cGAN model comprises two neural networks—a generator G and a discriminator D -that are trained simultaneously. The primary difference between cGAN and GAN is that in cGAN, both the generator and discriminator receive additional information y as input,which is often the label or condition for the data generation process.

条件生成对抗网络(cGAN)是标准GAN的扩展,允许基于输入标签或信息进行条件数据生成。cGAN模型包含两个神经网络——生成器G和判别器D,二者同时训练。cGAN与GAN的主要区别在于,生成器和判别器均接收额外信息y作为输入,通常是数据生成过程的标签或条件。

The generator G takes as input both a random noise vector z and a condition y ,and it generates a sample x~=G(z,y) that attempts to resemble the real data conditioned on y .

生成器G输入随机噪声向量z和条件y,生成一个样本x~=G(z,y),试图使其在条件y下与真实数据相似。

The discriminator D takes as input both a data sample x (or a generated sample x~ ) and the condition y ,and it outputs a probability D(x,y) (or D(x~,y) ) that the sample is real (i.e., from the training data) rather than generated by G :

判别器D输入数据样本x(或生成样本x~)及条件y,输出样本为真实(即来自训练数据)而非由G生成的概率D(x,y)(或D(x~,y)):

(38)D(x,y)=P(realx,y)

(39)D(x~,y)=P(realx~,y).

The objective functions for the generator and discriminator are derived from the original GAN framework but are conditioned on y .

生成器和判别器的目标函数源自原始GAN框架,但均以y为条件。

The discriminator tries to maximize the probability of correctly classifying the real data and minimize the probability of incorrectly classifying the generated data. The loss function for the discriminator is:

判别器试图最大化正确分类真实数据的概率,同时最小化错误分类生成数据的概率。判别器的损失函数为:

LD=Ex,ypdata (x,y)[logD(x,y)]

(40)Ezpz(z),ypdata (y)[log(1D(G(z,y),y))]

The generator tries to minimize the probability that the discriminator correctly distinguishes between real and generated data. The loss function for the generator is:

生成器试图最小化判别器正确区分真实与生成数据的概率。生成器的损失函数为:

(41)LG=Ezpz(z),ypdata (y)[logD(G(z,y),y)].

In practice, G and D engage in a minimax game. This is the same as described in Equation 37 for GAN.

在实际中,GD进行极小极大博弈。这与GAN中方程37所述相同。

The training process alternates between updating the discriminator D by maximizing LD with respect to the parameters of D ,and updating the generator G by minimizing LG with respect to the parameters of G .

训练过程交替进行:通过最大化LD更新判别器D的参数,随后通过最小化LG更新生成器G的参数。

The introduction of the Two-Stream Conditional Generative Adversarial Network (TScGAN) in [139] significantly improves mIoU scores across various state-of-the-art CNN-based semantic segmentation models, with increases such as 77.2% to 79.0% for DeepLabV3 and 81.6% to 83.6% for HRNet. TScGAN enhances both segmentation accuracy and processing speed by addressing higher-order inconsistencies in semantic segmentation and effectively utilizing dual input streams to preserve high-level contextual information. These improvements are particularly evident when applied to smaller image sizes (e.g., 512 - 512) on datasets like Cityscapes.

文献[139]中提出的双流条件生成对抗网络(TScGAN)显著提升了多种最先进基于CNN的语义分割模型的mIoU分数,例如DeepLabV3从77.2%提升至79.0%,HRNet从81.6%提升至83.6%。TScGAN通过解决语义分割中的高阶不一致性并有效利用双输入流保留高级上下文信息,提升了分割精度和处理速度。这些改进在处理较小图像尺寸(如512×512)且应用于Cityscapes等数据集时尤为明显。

C. VAE

C. 变分自编码器(VAE)

Variational Autoencoders, as introduced in [140], mark a significant advancement in generative modeling by combining deep learning with variational inference. Their core innovation lies in the use of a latent variable model to effectively capture complex data distributions. This approach provides a robust framework for approximating these distributions. VAEs are trained using SGD, which ensures efficient optimization and training. This methodology not only enhances the model's generative capabilities but also aids in uncovering the underlying structure of the data, making VAEs highly versatile in generative modeling tasks.

变分自编码器(Variational Autoencoders,VAE),如文献[140]所述,通过将深度学习与变分推断相结合,在生成建模领域取得了重大进展。其核心创新在于使用潜变量模型有效捕捉复杂的数据分布。这种方法为近似这些分布提供了一个稳健的框架。VAE采用随机梯度下降(SGD)进行训练,确保了优化和训练的高效性。该方法不仅提升了模型的生成能力,还帮助揭示数据的潜在结构,使VAE在生成建模任务中具有高度的通用性。

Figure 12 illustrates the training process of a VAE applied to a traffic scene reconstruction problem. VAEs are based on a latent variable model:

图12展示了将VAE应用于交通场景重建问题的训练过程。VAE基于潜变量模型:

(42)p(x,z)=p(xz)p(z),

where x is observed data, z is a latent variable, p(z) is the prior over latent variables,and p(xz) is the likelihood.

其中x为观测数据,z为潜变量,p(z)为潜变量的先验分布,p(xz)为似然函数。

The goal is to infer the posterior distribution p(zx) ,which is typically intractable. VAEs introduce a variational approximation qϕ(zx) to this posterior,where ϕ are parameters learned by the model.

目标是推断后验分布p(zx),该分布通常是不可解的。VAE引入了变分近似qϕ(zx)来逼近该后验,其中ϕ是模型学习的参数。

FIGURE 12. Training of a VAE for a traffic scene reconstruction: An observed input image x is passed through the encoder qϕ(zx) ,which approximates the posterior distribution p(zx) . The encoder produces the parameters of the latent variable z ,specifically the mean μzx and covariance zx ,representing the latent distribution as zN(μzx,zx) . This latent variable is then used by the decoder pϕ(xz) to generate a reconstructed image x^ ,with the aim to make x^ as close as possible to the original x . The training objective involves maximizing the ELBO on the marginal likelihood p(x) ,which is composed of a reconstruction loss (to reduce the difference between x and x^ ) and a KL divergence term DKL(qϕ(zx)p(z)) (to align the learned latent distribution with a standard normal prior). The reparameterization trick, represented by z=gϕ(ϵ,x) ,where ϵ is a noise variable,allows efficient backpropagation through the latent space. This process ensures effective training of the VAE, resulting in high-quality, realistic reconstructions of traffic scenes.

图12. 用于交通场景重建的VAE训练:观测输入图像x通过编码器qϕ(zx),该编码器近似后验分布p(zx)。编码器输出潜变量z的参数,具体为均值μzx和协方差zx,将潜变量分布表示为zN(μzx,zx)。然后,解码器pϕ(xz)利用该潜变量生成重建图像x^,目标是使x^尽可能接近原始图像x。训练目标包括最大化边际似然的证据下界(ELBO)p(x),该目标由重建损失(减少xx^之间的差异)和KL散度项DKL(qϕ(zx)p(z))(使学习的潜变量分布与标准正态先验对齐)组成。重参数技巧由z=gϕ(ϵ,x)表示,其中ϵ为噪声变量,允许在潜空间中高效反向传播。该过程确保了VAE的有效训练,生成高质量且逼真的交通场景重建图像。

The training of VAEs involves maximizing the Evidence Lower Bound (fELBO) on the marginal likelihood p(x) . The fELBO  is given by:

VAE的训练涉及最大化边际似然的证据下界(ELBO)(fELBO)。该ELBO定义为:

(43)fELBO=Eqϕ(zx)[logp(xz)]DKL(qϕ(zx)p(z)),

where DKL denotes the KL divergence.

其中DKL表示KL散度。

To enable gradient-based optimization, VAEs use the reparameterization trick, which allows the model to back-propagate through random nodes. If zqϕ(zx) ,it is reparameterized as:

为了实现基于梯度的优化,VAE采用重参数技巧,使模型能够通过随机节点进行反向传播。如果zqϕ(zx),则其重参数化为:

(44)z=frep ,ϕ(ϵ,x),

where frep ,ϕ(ϵ,x) is a deterministic function that transforms an auxiliary noise variable ϵ and the input x to generate the latent variable z . Typically, frep ,ϕ(ϵ,x) is defined such that:

其中frep ,ϕ(ϵ,x)是一个确定性函数,将辅助噪声变量ϵ和输入x转换生成潜变量z。通常,frep ,ϕ(ϵ,x)定义为:

(45)frep ,ϕ(ϵ,x)=μϕ(x)+σϕ(x)ϵ,

where μϕ(x) and σϕ(x) are outputs of the encoder network that represent the mean and standard deviation, respectively, ϵN(0,I) is a standard normal noise variable,and denotes the element-wise (Hadamard) product.

其中μϕ(x)σϕ(x)是编码器网络的输出,分别表示均值和标准差,ϵN(0,I)是标准正态噪声变量,表示逐元素(Hadamard)乘积。

In practice, VAEs are implemented using NNs. The encoder network approximates qϕ(zx) and produces the parameters of the latent distribution, while the decoder network models p(xz) .

在实际应用中,变分自编码器(VAE)通常通过神经网络(NNs)实现。编码器网络近似qϕ(zx)并生成潜在分布的参数,而解码器网络则对p(xz)进行建模。

VAEs are essential in traffic scene analysis, excelling in unsupervised tasks like data generation, denoising, and feature extraction. They create realistic and adversarial scenarios to enhance automated driving systems' robustness and are crucial for anomaly detection, boosting driving safety and efficiency. Applications include improving TLD [141], segmenting navigable spaces [142], detecting out-of-distribution (OOD) images in multi-label datasets [143], generating realistic traffic scenes [144], detecting adversarial driving scenes [145], and detecting traffic anomalies [146].

变分自编码器(VAE)在交通场景分析中至关重要,擅长无监督任务如数据生成、去噪和特征提取。它们通过创建逼真且对抗性的场景,提升自动驾驶系统的鲁棒性,并在异常检测中发挥关键作用,增强驾驶安全性和效率。应用包括提升交通灯检测(TLD)[141]、可通行空间分割[142]、多标签数据集中检测分布外(OOD)图像[143]、生成逼真交通场景[144]、检测对抗性驾驶场景[145]及交通异常检测[146]。

VATLD [141] adapts a state-of-the-art β -VAE [147] with additional regularization terms for prediction and perceptual loss, achieving a significant improvement in TLD accuracy compared to the base model, SSD MobileNet V1 [43], on the BSTLD dataset. Notably, this improvement is evident for red and yellow lights, reflected by an increase in overall accuracy (OA) from 0.478 to 0.493 in average precision at IoU 50 (AP@IoU50), albeit with a slight decrease in accuracy for green lights.

VATLD [141]基于最先进的β-VAE [147],引入预测和感知损失的额外正则项,在BSTLD数据集上相比基础模型SSD MobileNet V1 [43]显著提升了交通灯检测(TLD)准确率。尤其在红灯和黄灯检测上表现突出,整体准确率(OA)在IoU 50的平均精度(AP@IoU50)从0.478提升至0.493,尽管绿灯准确率略有下降。

NSS-VAE [142] is a dual-VAE architecture that excels in unsupervised segmentation of navigable spaces, surpassing 90% accuracy on the KITTI road benchmark. It outperforms traditional supervised methods, especially where ground truth labels are scarce. By merging deep features with GCNs to manage boundary uncertainties, NSS-VAE shows strong potential for autonomous navigation.

NSS-VAE [142]是一种双重VAE架构,擅长无监督的可通行空间分割,在KITTI道路基准测试中准确率超过90%。其性能优于传统的监督方法,尤其在缺乏真实标签的情况下表现突出。通过将深度特征与图卷积网络(GCNs)结合以处理边界不确定性,NSS-VAE展现出自动导航的强大潜力。

The approach in [143],based on β -VAE [147],efficiently detects out-of-distribution (OOD) images in multi-label datasets, which is critical for safe autonomous operations such as end-to-end driving. By leveraging compact latent spaces to represent variations in key generative factors, this method offers a cost-effective solution for OOD detection. Evaluation on the nuScenes dataset shows detection rates of 95% for time-of-day, 74% for traffic, and 100% for pedestrian partitions, highlighting its robustness in complex real-world scenarios.

[143]中的方法基于β-VAE [147],高效检测多标签数据集中的分布外(OOD)图像,这对于端到端驾驶等安全自动驾驶操作至关重要。该方法利用紧凑的潜在空间表示关键生成因子的变化,提供了一种成本效益高的OOD检测方案。在nuScenes数据集上的评估显示,时间段检测率为95%,交通检测率为74%,行人分区检测率达到100%,体现了其在复杂真实场景中的鲁棒性。

SceneGen [144] presents a neural autoregressive model for traffic scenes, generating new examples and evaluating existing ones without rules or heuristics, providing a flexible, scalable way to model real-world traffic complexity. It demonstrates significant realism improvements, with the lowest Negative Log-Likelihood (NLL) of 59.86 and an enhanced detection AP from 85.9% (using LayoutVAE) to 90.4% on the ATG4D dataset.

SceneGen [144]提出了一种用于交通场景的神经自回归模型,能够生成新样本并评估现有样本,无需规则或启发式方法,提供了一种灵活且可扩展的方式来建模真实交通的复杂性。该模型显著提升了逼真度,负对数似然(NLL)最低为59.86,ATG4D数据集上的检测平均精度(AP)从使用LayoutVAE的85.9%提升至90.4%。

A tree-structured VAE (T-VAE) for Semantically Adversarial Generation (SAG), designed to detect adversarial driving scenes in 3D point cloud segmentation models while adhering to traffic rules, is presented in [145]. This method enhances the robustness of automated driving systems by embedding traffic signs at the neuron level and leveraging explicit domain knowledge of object properties and relationships. The T-VAE-SAG approach demonstrates controllable and explainable generation, significantly reducing semantic constraint violations while maintaining data diversity. It achieves a reconstruction error (RE) of 14.5± 1.3 for targets from the Semantic KITTI dataset, even with random initialization.

[145]提出了一种用于语义对抗生成(SAG)的树结构变分自编码器(T-VAE),旨在检测3D点云分割模型中的对抗驾驶场景,同时遵守交通规则。该方法通过在神经元层面嵌入交通标志,并利用对象属性及其关系的显式领域知识,增强了自动驾驶系统的鲁棒性。T-VAE-SAG方法实现了可控且可解释的生成,显著减少了语义约束违规,同时保持数据多样性。在Semantic KITTI数据集上的目标重建误差(RE)达到14.5± 1.3,即使在随机初始化情况下亦表现优异。

In [146], an attention-based VAE (A-VAE) with 2D CNN and BiLSTM layers improved Recurrent VAE for anomaly detection on the UCSD dataset. The Recurrent VAE achieved 90.4% AUC and 15.8% Equal Error Rates (EER),while A-VAE achieved 91.7% AUC and 18.2% EER.

[146]中提出了一种基于注意力机制的变分自编码器(A-VAE),结合了二维卷积神经网络(2D CNN)和双向长短时记忆网络(BiLSTM)层,改进了用于UCSD数据集异常检测的循环变分自编码器(Recurrent VAE)。循环VAE实现了90.4%的AUC和15.8%的等错误率(EER),而A-VAE则达到了91.7%的AUC和18.2%的EER。

Clustering-based DA improves traffic scene understanding by clustering data to reveal shared structures, reducing feature discrepancies across weather, camera views, and sensor types. It enhances Person and Vehicle Re-ID by capturing domain-invariant features, with centroid alignment further closing domain gaps, and strengthens multi-object tracking and action recognition through refined temporal and spatial consistency. However, clustering-based DA faces challenges. It requires careful tuning to prevent misalignment in complex scenes, which can degrade performance if clusters capture noise instead of meaningful features. Additionally, the method may struggle in dynamic environments where clusters shift over time, affecting the consistency of multi-object tracking and action recognition. Managing computational demands and scalability also becomes challenging, especially in high-traffic scenarios with extensive data streams.

基于聚类的领域自适应(DA)通过聚类数据揭示共享结构,减少不同天气、摄像机视角和传感器类型之间的特征差异,从而提升交通场景理解。它通过捕捉领域不变特征增强了行人和车辆重识别(Re-ID),质心对齐进一步缩小领域差距,并通过精细的时空一致性强化了多目标跟踪和动作识别。然而,基于聚类的领域自适应面临挑战。它需要精细调参以防止复杂场景中的错位,如果聚类捕获的是噪声而非有意义特征,性能会下降。此外,该方法在动态环境中可能表现不佳,因为聚类随时间变化,影响多目标跟踪和动作识别的一致性。管理计算需求和扩展性也变得困难,尤其是在高流量场景中处理大量数据流时。

D.HPO FOR GENERATIVE MACHINE LEARNING MODELS

生成式机器学习模型的超参数优化(D.HPO)

HPO critically improves generative ML models for traffic scene understanding. By tuning parameters like learning rate, batch size, and latent dimensions in GANs and VAEs, models produce more realistic, diverse synthetic scenarios, enhancing autonomous driving and traffic management algorithms. Optimized models handle rare cases, improve realism, stability, and generalization, prevent mode collapse, and accelerate convergence, thus reducing training time and resources. This ensures robust, efficient, and safe performance under diverse real-world conditions.

超参数优化(HPO)显著提升了用于交通场景理解的生成式机器学习模型。通过调节GAN(生成对抗网络)和VAE(变分自编码器)中的学习率、批量大小和潜在维度等参数,模型能够生成更真实、多样的合成场景,增强自动驾驶和交通管理算法。优化后的模型能处理罕见情况,提高真实感、稳定性和泛化能力,防止模式崩溃,加速收敛,从而减少训练时间和资源消耗。这确保了模型在多样化真实环境下的稳健、高效和安全性能。

Reviewing GAN models, Adam optimizer was utilized with a learning rate of 0.001 for FutureGAN in [126], gradually reducing the learning rate and using different values for β1 (0.0 for FutureGAN and 0.5 for DeblurGAN) to optimize model convergence. Authors of [134] trained SRGAN using Adam with a learning rate of 104 initially, which was reduced to 105 after 100,000 iterations,and incorporated VGG loss rescaling to balance the MSE loss. The study in [135] also implemented the Adam optimizer with a learning rate of 104 ,employing a linear decay over 150 epochs. In [136] Adam optimizer is applied with a starting learning rate of 0.001 , decaying it by 0.87 at each resolution step in FutureGAN, and used penalty coefficients (λ=10andϵ=0.001) in the WGAN-GP loss for improved stability.

回顾GAN模型,文献[126]中FutureGAN使用Adam优化器,初始学习率为0.001,逐步降低学习率,并针对β1采用不同值(FutureGAN为0.0,DeblurGAN为0.5)以优化模型收敛。文献[134]的SRGAN采用Adam优化器,初始学习率为104,在10万次迭代后降至105,并引入VGG损失重标定以平衡均方误差(MSE)损失。文献[135]同样使用Adam优化器,学习率为104,并在150个epoch内线性衰减。文献[136]中FutureGAN采用Adam优化器,起始学习率为0.001,每个分辨率步骤衰减0.87,并在WGAN-GP损失中使用惩罚系数(λ=10andϵ=0.001)以提升稳定性。

In VAE models, authors of [148] trained their network using the Adam optimizer with a learning rate of 0.0001 , setting the momentums to 0.5 and 0.999 . They used a batch size of 1 and defined specific parameter values (λm=0.01 , λadv =1,λrecon =10 ) for their loss functions,training the model for 250,000 iterations. Authors of [143] also employed the Adam optimizer but with a much lower learning rate of 1× 105 . They performed a hyperparameter search over β values ranging from 1.0 to 1.9 and latent dimensions (nLatent) from 5 to 30,ultimately finding that a β value in combination with n Latent =30 yielded the best results. Their model architecture included an encoder with convolutional layers and a symmetrical decoder, and they trained their model for 100 epochs.

在VAE模型中,文献[148]使用Adam优化器训练网络,学习率为0.0001,动量参数设为0.5和0.999,批量大小为1,并为损失函数定义了特定参数值(λm=0.01λadv =1,λrecon =10,训练迭代次数为25万次。文献[143]也采用Adam优化器,但学习率更低,为1×105。他们对β参数进行了超参数搜索,范围从1.0到1.9,潜在维度(nLatent)从5到30,最终发现β值与n潜在=30的组合效果最佳。其模型架构包括带卷积层的编码器和对称解码器,训练周期为100个epoch。

For cGANs, TScGAN [139] was trained using a learning rate of 0.0005 and a batch size of 32 across 150 training epochs. These training settings led to notable improvements in mean mIoU scores across different segmentation models, with DeepLabV3 achieving an increase from 77.2 to 79.0 and HRNet improving from 81.6 to 83.6.

对于条件生成对抗网络(cGANs),TScGAN[139]采用学习率0.0005,批量大小32,训练150个epoch。这些训练设置显著提升了不同分割模型的平均交并比(mIoU)得分,DeepLabV3从77.2提升至79.0,HRNet从81.6提升至83.6。

E. COMPARISON OF GENERATIVE MACHINE LEARNING MODELS

E. 生成式机器学习模型比较

A comparison of different categories of generative ML models is presented in table Table 3. For the classification section, when applied to the Cityscapes dataset, Cycle-GAN [129] improved to an mIoU of 78.20%, while Dirty-GAN [129] achieved a higher mIoU of 91.71%. Meanwhile, the AttGAN [132] model achieved an F1-score of 96% for car classification when using synthetic snowy data generated by AttGAN compared to a score of 91% for the classifier trained without the synthetic data. For segmentation, GAN-based models have also demonstrated strong performance, with CycleGAN [131] achieving an mIoU of 71.6% on the CityScapes dataset for urban segmentation under severe weather conditions.

表3展示了不同类别生成式机器学习模型的比较。在分类部分,应用于Cityscapes数据集时,Cycle-GAN[129]的mIoU提升至78.20%,而Dirty-GAN[129]则达到更高的91.71%。同时,AttGAN[132]模型在使用AttGAN生成的合成雪天数据进行汽车分类时,F1分数达到96%,相比之下未使用合成数据训练的分类器得分为91%。在分割任务中,基于GAN的模型也表现出强劲性能,CycleGAN[131]在CityScapes数据集的恶劣天气城市分割中实现了71.6%的mIoU。

GANs have also been applied for traffic image enhancement. For the Cityscapes dataset, multiple GAN methods were evaluated, with FutureGAN [126] achieving a PSNR of 22.38 and an SSIM of 0.61 . Compared to these results, the DeblurGAN [126] obtains a PSNR of 21.95 and SSIM of 0.59 , while SRGAN [126] had a lower PSNR of 20.49 and SSIM of 0.49 . On the RainDegraded dataset, DCGAN [130] achieved a PSNR of 24.98 and SSIM of 0.81, with ImprovedGAN [130] further improving these metrics to 25.81 and 0.84 , respectively. Finally, on the FogDegraded (RESIDE) dataset, DeblurGAN [130] achieved a PSNR of 24.42 and SSIM of 0.81, while ImprovedGAN [130] achieved 25.79 and 0.88 , respectively.

生成对抗网络(GANs)也被应用于交通图像增强。在Cityscapes数据集上,评估了多种GAN方法,其中FutureGAN [126] 达到了22.38的峰值信噪比(PSNR)和0.61的结构相似性指数(SSIM)。相比之下,DeblurGAN [126] 获得了21.95的PSNR和0.59的SSIM,而SRGAN [126] 的PSNR和SSIM分别较低,为20.49和0.49。在RainDegraded数据集上,DCGAN [130] 达到了24.98的PSNR和0.81的SSIM,ImprovedGAN [130] 进一步将这些指标提升至25.81和0.84。最后,在FogDegraded(RESIDE)数据集上,DeblurGAN [130] 实现了24.42的PSNR和0.81的SSIM,而ImprovedGAN [130] 分别达到了25.79和0.88。

The methods presented in the scene generation section in Table 3 are all based on vehicle action recognition. For the ATG4D large-scale traffic scene dataset, LayoutVAE [144] achieved an NLL of 210.80 nats (where "nats" denotes the unit of measurement for NLL when using the natural logarithm). On the same dataset, SceneGen [144] achieved a significantly improved NLL of 59.86 nats. For the Semantic KITTI dataset,T-VAE [145] reported an RE of 135.1±16.9 , while T-VAE-SAG [145] achieved the best RE of 14.5 ± 1.3, significantly outperforming both the baseline VAE, with an RE of 110.4±10.6 ,and T-VAE without SAG. Finally, for anomaly detection on the UCSD dataset, Recurrent VAE [146] achieved an AUC of 90.4% and an EER of 15.8%. A-VAE [146] further improved these metrics on this dataset, with an AUC of 91.7% and an EER of 18.2% ,demonstrating the benefits of incorporating the attention mechanism in this context.

表3中场景生成部分展示的方法均基于车辆动作识别。对于ATG4D大规模交通场景数据集,LayoutVAE [144] 实现了210.80 nats的负对数似然(NLL)(其中“nats”表示使用自然对数时NLL的计量单位)。在同一数据集上,SceneGen [144] 显著提升了NLL至59.86 nats。对于Semantic KITTI数据集,T-VAE [145] 报告了RE为135.1±16.9,而T-VAE-SAG [145] 实现了最佳RE为14.5 ± 1.3,显著优于基线VAE(RE为110.4±10.6)和不含SAG的T-VAE。最后,在UCSD数据集上的异常检测中,Recurrent VAE [146] 实现了90.4%的AUC和15.8%的EER。A-VAE [146] 在该数据集上进一步提升了这些指标,AUC为91.7%,EER为18.2%,展示了引入注意力机制的优势。

V. DOMAIN ADAPTATION MODELS

五、领域自适应模型

Domain Adaptation (DA) methods are essential for improving traffic scene understanding across diverse environments. Traditional models struggle with distribution differences between training and testing datasets, resulting in poor generalization. DA enables models trained in one domain (e.g., specific weather conditions or regions) to effectively handle new, unseen domains. Unlike earlier approaches relying on hand-crafted features-prone to biases and limited expressiveness-DL-based DA uses DNNs for automatic feature extraction, better capturing complex, high-dimensional relationships.

领域自适应(Domain Adaptation,DA)方法对于提升不同环境下的交通场景理解至关重要。传统模型难以应对训练和测试数据分布差异,导致泛化能力差。DA使得在一个领域(如特定天气条件或地区)训练的模型能够有效处理新的、未见过的领域。与早期依赖易受偏差且表达能力有限的手工特征的方法不同,基于深度学习的DA利用深度神经网络(DNN)自动提取特征,更好地捕捉复杂的高维关系。

While there exist different ways of categorizing deep DA methods, they can broadly be divided into three classes: clustering-based, discrepancy-based, and adversarial-based approaches. Clustering-based methods aim to group target domain data points with similar features to those in the source domain, facilitating knowledge transfer through clustering techniques. Discrepancy-based methods focus on minimizing statistical distances, like Maximum Mean Discrepancy (MMD), between source and target feature distributions for better alignment. Adversarial-based methods use adversarial learning techniques to reduce the gap between source and target domains by training a model to fool a domain discriminator, making features indistinguishable.

虽然深度领域自适应方法有多种分类方式,但大致可分为三类:基于聚类、基于差异和基于对抗的方法。基于聚类的方法旨在将目标域数据点与源域中具有相似特征的数据点分组,通过聚类技术促进知识迁移。基于差异的方法侧重于最小化源域和目标域特征分布之间的统计距离,如最大均值差异(MMD),以实现更好的对齐。基于对抗的方法则利用对抗学习技术,通过训练模型欺骗域判别器,缩小源域和目标域之间的差距,使特征难以区分。

DA models are crucial for traffic scene understanding, allowing DNNs to adapt to varying lighting, weather, and geographic conditions without extensive retraining or large labeled datasets. This flexibility enables a single model to function effectively across diverse regions. By ensuring smooth knowledge transfer, DA models improve accuracy, reliability, and efficiency in traffic analysis and prediction, ultimately making transportation networks safer and more efficient.

领域自适应模型对于交通场景理解至关重要,使深度神经网络能够适应不同的光照、天气和地理条件,无需大量重新训练或标注数据。这种灵活性使单一模型能够在多样化区域中有效运行。通过确保知识的平滑迁移,领域自适应模型提升了交通分析和预测的准确性、可靠性和效率,最终使交通网络更安全、更高效。

In the following sections, we explore the overarching categories of models and techniques in depth. The mechanisms behind each specific adaptation strategy, their real-world applications, and their ability to address data variability and improve model generalization are thoroughly examined. Finally, We also discuss HPO for these strategies and compare their performance metrics to provide a comprehensive overview.

在接下来的章节中,我们将深入探讨各类模型和技术。详细分析每种具体自适应策略的机制、实际应用及其解决数据变异性和提升模型泛化能力的能力。最后,我们还将讨论这些策略的超参数优化(HPO)并比较其性能指标,以提供全面的概览。

A. CLUSTERING-BASED DOMAIN ADAPTATION

A. 基于聚类的领域自适应

Clustering-based DA is a technique that helps a model trained on one domain, the source domain, perform well on another domain, the target domain, by using clustering techniques to identify shared structures between the two domains. The main idea is to group data points from both domains into clusters that capture common characteristics and use these clusters to guide the adaptation process.

基于聚类的领域自适应是一种技术,通过聚类方法识别源域和目标域之间的共享结构,帮助在一个域(源域)训练的模型在另一个域(目标域)上表现良好。其核心思想是将两个域的数据点分组为捕捉共同特征的簇,并利用这些簇指导自适应过程。

Figure 13 depicts the application of clustering-based DA to image classification in a traffic scene. Let Ds={(xis,yis)}i=1ns be the labeled source domain dataset,where xis is the i -th feature vector in the source domain and yis is its corresponding label. Let Dt={xjt}j=1nt be the unlabeled target domain dataset,where xjt is the j -th feature vector in the target domain. The task of DA is to adapt a model trained on Ds so that it performs well on the target domain Dt ,even though the target domain has different data distributions.

图13展示了基于聚类的领域自适应(DA)在交通场景图像分类中的应用。设Ds={(xis,yis)}i=1ns为带标签的源域数据集,其中xis是源域中的第i个特征向量,yis是其对应的标签。设Dt={xjt}j=1nt为无标签的目标域数据集,其中xjt是目标域中的第j个特征向量。领域自适应的任务是使在Ds上训练的模型能够在目标域Dt上表现良好,尽管目标域的数据分布不同。

Clustering-based DA works by grouping both source and target domain data into clusters (or pseudo-labels) and aligning these clusters across domains. The core hypothesis is that the shared clusters between the two domains capture common features that help the model generalize from the source to the target domain.

基于聚类的领域自适应通过将源域和目标域数据分组为聚类(或伪标签),并在域间对齐这些聚类来实现。核心假设是两个域之间共享的聚类捕捉了有助于模型从源域泛化到目标域的共同特征。

Let fcluster (x) denote a clustering function that assigns a data point x to a cluster. We can define two clustering functions: fcluster ,s(xs) for the source domain data,and fcluster ,t(xt) for the target domain data. A simple clustering approach could use k -means clustering:

fcluster (x)表示将数据点x分配到某个聚类的聚类函数。我们可以定义两个聚类函数:fcluster ,s(xs)用于源域数据,fcluster ,t(xt)用于目标域数据。一种简单的聚类方法是使用k均值聚类:

(46)fcluster (x)=argminkxμk2,

where μk is the centroid of the k -th cluster,and denotes the L2 norm. The number of clusters, K ,is usually the same for both the source and target domains.

其中μk是第k个聚类的质心,表示L2范数。聚类数目K通常在源域和目标域中保持一致。

To adapt between the source and target domains, we want to ensure that the clusters in the source domain align with the clusters in the target domain. This can be formalized as minimizing the difference between the source and target clusters. Specifically, we define a domain alignment loss Lalign  based on aligning the centroids of corresponding clusters in the source and target domains:

为了实现源域和目标域之间的适应,我们希望确保源域的聚类与目标域的聚类对齐。这可以形式化为最小化源域和目标域聚类之间的差异。具体地,我们定义了基于对齐源域和目标域对应聚类质心的领域对齐损失Lalign 

(47)Lalign =i=1Kμisμit2,

where μis and μit are the centers of cluster i in the source and target domains, respectively. Minimizing this loss encourages the distributions of the clusters in the source and target domains to be similar.

其中μisμit分别是源域和目标域中第i个聚类的中心。最小化该损失促使源域和目标域聚类的分布相似。

In addition to clustering the target domain data, we can also assign pseudo-labels to the target data based on the clustering. The pseudo-label for a target domain data point xjt is assigned based on its nearest cluster centroid:

除了对目标域数据进行聚类外,我们还可以基于聚类结果为目标数据分配伪标签。目标域数据点xjt的伪标签是根据其最近的聚类质心分配的:

(48)y^jt=argminkxjtμkt2

TABLE 3. A comprehensive comparison of various generative ML models applied to traffic scene understanding, highlighting the differences in applications, frameworks, variance across models, datasets utilized, performance metrics, and the resulting effectiveness in their respective applications.

表3. 各种生成式机器学习模型在交通场景理解中的综合比较,重点展示了它们在应用、框架、模型间差异、使用的数据集、性能指标及其在各自应用中的效果差异。

ApplicationFrameworkVarianceDatasetPerformance MetricResult
ClassificationGANCycleGAN [128]RareEvents_DSmAP45.5%
GANAttGAN [132]BDDF1-score86%
Object DetectionVAEVATLD [141]BSTLDAP@IoU500.49
VAEMobileNet V1 [43]BSTLDAP@IoU500.48
SegmentationGANMTPanClass [127]CityscapesmloU89.3%
GANCycleGAN [129]CityscapesmIoU78.20%
GANDirtyGAN [129]CityscapesmloU91.71%
GANCycleGAN [131]CityScapesmloU71.6%
GANIEC-Net [131]CityScapesmloU89.3%
cGANDeepLabV3+TScGAN [139]CityscapesmIoU79.0%
cGANPSPNet+TScGAN [139]CityscapesmloU81.3%
cGANHRNet+TScGAN [139]CityscapesmIoU83.6%
cGANHMSA+TScGAN [139]CityscapesmIoU86.8%
VAENSS-VAE [142]KITTIAccuracy90%
VAE\( \beta \) -VAE [143]nuScenesDetection Rate74%
Image EnhancementGANFutureGAN [126]CityscapesPSNR, SSIM22.38, 0.61
GANDeblurGAN [126]CityscapesPSNR, SSIM21.95, 0.59
GANSRGAN [126]CityscapesPSNR, SSIM20.49, 0.49
GANDCGAN [130]RainDegradedPSNR, SSIM24.98, 0.81
GANImprovedGAN [130]RainDegradedPSNR, SSIM25.81, 0.84
GANDeblurGAN [130]FogDegraded (RESIDE)PSNR, SSIM24.42, 0.81
GANImprovedGAN [130]FogDegraded (RESIDE)PSNR, SSIM25.79, 0.88
Reconstructing Traffic StatesGANTSRGAN [125]NGSIMTSS32.595
Scene GenerationVAELayoutVAE [144]ATG4DNLL210.80
VAESceneGen [144]ATG4DNLL59.86
VAEVAE [145]Semantic KITTIRE\( {110.4} \pm {10.6} \)
VAEVAE-WR [145]Semantic KITTIRE\( {105.9} \pm {24.6} \)
VAEGVAE [145]Semantic KITTIRE\( {123.7} \pm {9.5} \)
VAET-VAE [145]Semantic KITTIRE\( {135.1} \pm {16.9} \)
VAET-VAE-SAG [145]Semantic KITTIRE\( {14.5} \pm {1.3} \)
Anomaly DetectionVAERecurrent VAE [146]UCSDAUC, EER90.4%, 15.8%
VAEA-VAE [146]UCSDAUC, EER91.7%, 18.2%
Path PredictionGANSophie [133]ETHADE, FDE0.70, 1.43
应用框架方差数据集性能指标结果
分类生成对抗网络(GAN)CycleGAN [128]RareEvents_DS平均精度均值(mAP)45.5%
生成对抗网络(GAN)AttGAN [132]BDDF1分数86%
目标检测变分自编码器(VAE)VATLD [141]BSTLDAP@IoU500.49
变分自编码器(VAE)MobileNet V1 [43]BSTLDAP@IoU500.48
分割生成对抗网络(GAN)MTPanClass [127]Cityscapes平均交并比(mloU)89.3%
生成对抗网络(GAN)CycleGAN [129]Cityscapes平均交并比(mIoU)78.20%
生成对抗网络(GAN)DirtyGAN [129]Cityscapes平均交并比(mloU)91.71%
生成对抗网络(GAN)CycleGAN [131]CityScapes平均交并比(mloU)71.6%
生成对抗网络(GAN)IEC-Net [131]CityScapes平均交并比(mloU)89.3%
条件生成对抗网络(cGAN)DeepLabV3+TScGAN [139]Cityscapes平均交并比(mIoU)79.0%
条件生成对抗网络(cGAN)PSPNet+TScGAN [139]Cityscapes平均交并比(mloU)81.3%
条件生成对抗网络(cGAN)HRNet+TScGAN [139]Cityscapes平均交并比(mIoU)83.6%
条件生成对抗网络(cGAN)HMSA+TScGAN [139]Cityscapes平均交并比(mIoU)86.8%
变分自编码器(VAE)NSS-VAE [142]KITTI准确率90%
变分自编码器(VAE)\( \beta \) -VAE [143]nuScenes检测率74%
图像增强生成对抗网络(GAN)FutureGAN [126]Cityscapes峰值信噪比(PSNR), 结构相似性指数(SSIM)22.38, 0.61
生成对抗网络(GAN)DeblurGAN [126]Cityscapes峰值信噪比(PSNR), 结构相似性指数(SSIM)21.95, 0.59
生成对抗网络(GAN)SRGAN [126]Cityscapes峰值信噪比(PSNR), 结构相似性指数(SSIM)20.49, 0.49
生成对抗网络(GAN)DCGAN [130]雨天退化峰值信噪比(PSNR), 结构相似性指数(SSIM)24.98, 0.81
生成对抗网络(GAN)ImprovedGAN [130]雨天退化峰值信噪比(PSNR), 结构相似性指数(SSIM)25.81, 0.84
生成对抗网络(GAN)DeblurGAN [130]雾天退化 (RESIDE)峰值信噪比(PSNR), 结构相似性指数(SSIM)24.42, 0.81
生成对抗网络(GAN)ImprovedGAN [130]雾天退化 (RESIDE)峰值信噪比(PSNR), 结构相似性指数(SSIM)25.79, 0.88
交通状态重建生成对抗网络(GAN)TSRGAN [125]NGSIMTSS32.595
场景生成变分自编码器(VAE)LayoutVAE [144]ATG4D负对数似然(NLL)210.80
变分自编码器(VAE)SceneGen [144]ATG4D负对数似然(NLL)59.86
变分自编码器(VAE)变分自编码器(VAE) [145]Semantic KITTIRE\( {110.4} \pm {10.6} \)
变分自编码器(VAE)VAE-WR [145]Semantic KITTIRE\( {105.9} \pm {24.6} \)
变分自编码器(VAE)GVAE [145]Semantic KITTIRE\( {123.7} \pm {9.5} \)
变分自编码器(VAE)T-VAE [145]Semantic KITTIRE\( {135.1} \pm {16.9} \)
变分自编码器(VAE)T-VAE-SAG [145]Semantic KITTIRE\( {14.5} \pm {1.3} \)
异常检测变分自编码器(VAE)循环变分自编码器(Recurrent VAE)[146]加州大学圣地亚哥分校(UCSD)曲线下面积(AUC),等错误率(EER)90.4%, 15.8%
变分自编码器(VAE)A-VAE [146]加州大学圣地亚哥分校(UCSD)曲线下面积(AUC),等错误率(EER)91.7%, 18.2%
路径预测生成对抗网络(GAN)Sophie [133]ETH平均位移误差(ADE),最终位移误差(FDE)0.70, 1.43

where y^jt represents the pseudo-label assigned to xjt . This pseudo-label is then used to refine the model by incorporating target domain data in a semi-supervised manner.

其中 y^jt 表示分配给 xjt 的伪标签。该伪标签随后被用于通过半监督方式结合目标域数据来优化模型。

The objective function in clustering-based DA consists of three parts. The first part is the source domain loss, which is the classification loss on the source domain, and can be a standard supervised learning loss (e.g., cross-entropy):

基于聚类的领域自适应(DA)中的目标函数由三部分组成。第一部分是源域损失,即源域上的分类损失,可以是标准的监督学习损失(例如交叉熵):

(49)Ls=1nsi=1ns(fs(xis),yis),

where fs(xis) is the predicted label for source data and is the classification loss function.

其中 fs(xis) 是源数据的预测标签, 是分类损失函数。

The second part is the domain alignment loss, which ensures the alignment between the cluster distributions in the source and target domains. Specifically, we minimize the distance between the cluster centroids in the source and target domains as follows:

第二部分是域对齐损失,确保源域和目标域中聚类分布的一致性。具体来说,我们最小化源域和目标域中聚类中心之间的距离,公式如下:

(50)Lalign =i=1Kμisμit2,

where μis and μit represent the centers of cluster i in the source and target domains, respectively. Minimizing this loss helps align the clusters across domains.

其中 μisμit 分别表示源域和目标域中第 i 个聚类的中心。最小化该损失有助于实现跨域聚类的对齐。

The third part involves using the pseudo-labels from the target domain to compute a classification loss Lt on the target domain. This pseudo-label loss helps the model learn from the target domain:

第三部分涉及利用目标域的伪标签计算目标域上的分类损失 Lt 。该伪标签损失帮助模型从目标域中学习:

(51)Lt=1ntj=1nt(f(xjt),y^jt),

where y^jt is the pseudo-label for the target domain data point xit . By incorporating pseudo-labels,the model can iteratively refine itself to better predict target domain labels.

其中 y^jt 是目标域数据点 xit 的伪标签。通过引入伪标签,模型可以迭代地自我优化,更好地预测目标域标签。

The total loss function for clustering-based DA is the weighted sum of these three components:

基于聚类的领域自适应的总损失函数是这三部分的加权和:

(52)L=Ls+λLalign +γLt,

where λ (domain alignment weight) and γ (pseudo-labeling weight) are trade-off parameters that control the importance of the domain alignment loss and the target pseudo-labeling loss, respectively, compared to the source classification loss. Specifically, λ controls how much emphasis is placed on aligning the clusters between the source and target domains,while γ adjusts the contribution of learning from pseudo-labels in the target domain.

其中 λ(域对齐权重)和 γ(伪标签权重)是权衡参数,分别控制域对齐损失和目标伪标签损失相对于源分类损失的重要性。具体来说,λ 控制源域与目标域聚类对齐的重视程度,而 γ 调整从目标域伪标签学习的贡献。

FIGURE 13. Clustering-based DA for image classification in a traffic scene: A CNN is used to extract features from both a labeled source dataset and an unlabeled target dataset, representing different traffic environments. These features are grouped into clusters for both domains to identify common feature patterns. In the source clusters, blue and orange dots represent distinct clusters (clusters i and j ). Similarly, in the target clusters, blue and orange triangles represent corresponding clusters. The cluster matching and domain alignment step aligns similar clusters from the source and target domains, as indicated by the dashed green line, to ensure that shared features are well-represented across both domains. A classification NN is trained using labeled source data, and the learned weights are shared across both source and target domains. For the source domain, true labels guide training with a source classification loss. The same classification NN with shared weights is used for the target domain, where pseudo-labels are assigned to target data based on cluster alignments. Source output and target output illustrate the predictions made by the classification NN for both domains. These outputs reflect classification decisions such as whether an object is a "Person," "Car," or "Sign." Dashed lines represent the flow of these classification predictions from the NN to their respective outputs. For the source domain, predictions are validated using true labels, whereas for the target domain, pseudo-labels derived from cluster alignment are used. The figure also highlights a potential misclassification in the target domain, shown with a warning symbol, indicating the challenges of aligning the target domain data due to differences from the source domain. The overall clustering-based DA approach helps reduce such errors by adapting shared features between the domains effectively.

图13. 用于交通场景图像分类的基于聚类的领域自适应:使用卷积神经网络(CNN)从带标签的源数据集和无标签的目标数据集中提取特征,这两个数据集代表不同的交通环境。这些特征在两个域中被分组为聚类,以识别共同的特征模式。在源域聚类中,蓝色和橙色点分别代表不同的聚类(聚类 ij)。类似地,在目标域聚类中,蓝色和橙色三角形代表对应的聚类。聚类匹配和域对齐步骤通过虚线绿色线条将源域和目标域中相似的聚类对齐,确保共享特征在两个域中得到良好表示。分类神经网络使用带标签的源数据进行训练,学习到的权重在源域和目标域共享。对于源域,真实标签指导训练并计算源分类损失。相同的共享权重分类神经网络用于目标域,目标数据基于聚类对齐分配伪标签。源输出和目标输出展示了分类神经网络对两个域的预测结果,这些输出反映了诸如“人”、“汽车”或“标志”等分类决策。虚线表示这些分类预测从神经网络流向各自输出的过程。源域的预测通过真实标签验证,而目标域的预测则使用基于聚类对齐的伪标签。图中还突出显示了目标域中可能的误分类,用警告符号标示,表明由于目标域与源域的差异,数据对齐存在挑战。整体基于聚类的领域自适应方法通过有效适应两个域间的共享特征,有助于减少此类错误。

Clustering is performed on both the source and target domain data. The alignment between the clusters in the source and target domains is ensured by minimizing the distance between cluster centers, typically using a domain alignment loss as described above. Additionally, pseudo-labels for the target domain data allow the model to learn directly from the target domain in a semi-supervised manner. The model is trained on the source domain while regularizing with both the alignment loss and the pseudo-label loss to ensure that the model also works well on the target domain.

聚类操作同时在源域和目标域数据上进行。通过最小化聚类中心之间的距离(通常使用上述域对齐损失)来确保源域和目标域聚类的一致性。此外,目标域数据的伪标签使模型能够以半监督方式直接从目标域学习。模型在源域上训练,同时通过对齐损失和伪标签损失进行正则化,以确保模型在目标域上也能表现良好。

Clustering-based DA methods have advanced person reidentification (Person Re-ID) in smart surveillance systems [149], improved pedestrian tracking to enhance urban safety [150], optimized object detection for autonomous driving applications [151], and facilitated semantic segmentation in remote mapping for geographic information systems [152]. Moreover, these methods increase detection reliability across diverse environmental conditions [153].

基于聚类的领域自适应方法推动了智能监控系统中的行人重识别(Person Re-ID)[149],提升了行人跟踪以增强城市安全[150],优化了自动驾驶应用中的目标检测[151],并促进了地理信息系统中遥感地图的语义分割[152]。此外,这些方法提高了在多样环境条件下的检测可靠性[153]。

Contrastive learning [154] enhances robustness against occlusion by teaching models to distinguish similar and dissimilar objects. Treating occluded and unoccluded instances as positive pairs helps learn occlusion-invariant features, reducing reliance on full visibility and enabling recognition even when objects are partially obscured.

对比学习[154]通过教模型区分相似和不相似的对象,增强了对遮挡的鲁棒性。将遮挡和未遮挡的实例视为正样本对,有助于学习遮挡不变特征,减少对完全可见性的依赖,使得即使对象部分被遮挡也能被识别。

In [149], Cluster-based Dual-branch Contrastive Learning (CDCL) tackles data noise and clothing color confusion in unsupervised domain adaptation (UDA) for Person Re-ID. Building on contrastive learning principles [155], CDCL uses partially grayed images and a dual-branch network, achieving 81.5% mAP from DukeMTMC-ReID to Market1501 and improving pseudo-label reliability.

在[149]中,基于聚类的双分支对比学习(Cluster-based Dual-branch Contrastive Learning, CDCL)解决了无监督域适应(UDA)中数据噪声和服装颜色混淆的问题。基于对比学习原理[155],CDCL使用部分灰度图像和双分支网络,实现了从DukeMTMC-ReID到Market1501的81.5% mAP,并提升了伪标签的可靠性。

A deep mutual distillation (DMD) framework for UDA Person Re-ID is introduced, drawing inspiration from the teacher-student paradigm [156]. This framework employs two parallel feature extraction branches that act as teachers for each other, enhancing pseudo-label quality. Combined with a bilateral graph representation to align identity features via visual and attribute consistency, this approach achieves 92.7% mAP from DukeMTMC-reID to Market1501.

提出了一种用于UDA行人重识别的深度互蒸馏(Deep Mutual Distillation, DMD)框架,借鉴了师生范式[156]。该框架采用两个并行的特征提取分支,彼此作为教师,提升伪标签质量。结合双边图表示,通过视觉和属性一致性对身份特征进行对齐,该方法实现了从DukeMTMC-reID到Market1501的92.7% mAP。

In [151], ConfMix addresses UDA in object detection with region-level confidence-based sample mixing. By blending target regions and confident pseudo detections from source images and adding consistency loss, it adapts the model to the target domain. Progressive pseudo-label filtering achieves 52.2% mAP from KITTI to Cityscapes.

在[151]中,ConfMix通过基于区域置信度的样本混合解决了目标检测中的UDA问题。通过融合目标区域和来自源图像的高置信度伪检测结果,并加入一致性损失,使模型适应目标域。渐进式伪标签过滤实现了从KITTI到Cityscapes的52.2% mAP。

Semantic segmentation domain shifts are addressed through adversarial-based DA in FFREEDA (Federated source-Free Domain Adaptation) [152]. Leveraging unlabeled client data with a pre-trained server model, LADD (Learning Across Domains and Devices) employs adversarial self-supervision, ad-hoc regularization, and federated clustered aggregation with cluster-specific classifiers, achieving 40.16±1.02% mIoU from GTA5 to Mapillary.

语义分割的域偏移通过FFREEDA(联邦无源域适应)[152]中的对抗性域适应方法得到解决。利用无标签客户端数据和预训练服务器模型,LADD(跨域与设备学习)采用对抗自监督、特设正则化及带有聚类特定分类器的联邦聚类聚合,实现了从GTA5到Mapillary的40.16±1.02% mIoU。

CFFA, a coarse-to-fine feature adaptation approach for cross-domain object detection, is proposed in [153]. It uses multi-layer adversarial learning for marginal alignment and global prototype matching for conditional alignment. Results include 38.6% mAP from Cityscapes to Foggy Cityscapes, 43.8% AP for Car from SIM10k to Cityscapes,and 41.0% mAP from Cityscapes to KITTI.

在[153]中提出了CFFA,一种用于跨域目标检测的粗到细特征适应方法。它采用多层对抗学习进行边缘对齐,并通过全局原型匹配实现条件对齐。结果包括从Cityscapes到Foggy Cityscapes的38.6% mAP,从SIM10k到Cityscapes的汽车类别43.8% AP,以及从Cityscapes到KITTI的41.0% mAP。

Clustering-based DA aids traffic scene understanding by grouping data into clusters that reveal shared structures, reducing feature differences across varying conditions (weather, camera views, sensors). It improves Person and Vehicle Re-ID by capturing domain-invariant features (body shape, vehicle silhouette). Techniques like centroid alignment and cluster-wise feature matching minimize domain gaps. Clustering-based DA enhances multi-object tracking (refining temporal and spatial consistency) and strengthens action recognition (leveraging contextual relations). It improves cross-domain representation, reduces retraining, and enables scalable performance for autonomous driving and traffic monitoring.

基于聚类的域适应通过将数据分组为揭示共享结构的簇,帮助交通场景理解,减少不同条件(天气、摄像头视角、传感器)下的特征差异。它通过捕捉域不变特征(如人体形态、车辆轮廓)提升行人和车辆重识别性能。质心对齐和簇内特征匹配等技术最小化域间差距。基于聚类的域适应增强了多目标跟踪(优化时空一致性)和动作识别(利用上下文关系),提升跨域表示能力,减少重训练,实现自动驾驶和交通监控的可扩展性能。

B. DISCREPANCY-BASED DOMAIN ADAPTATION

B. 基于差异的域适应

Discrepancy-based DA aims to minimize the difference between source and target domain distributions to transfer knowledge from a labeled source domain to a target domain with limited or no labeled data. The key challenge in this approach is addressing the distribution shift between the probability distributions of the source and target domains. By reducing the discrepancy between these feature distributions, the model trained on the source domain can generalize effectively to the target domain, ensuring better performance despite domain differences.

基于差异的域适应旨在最小化源域和目标域分布之间的差异,将带标签的源域知识迁移到标签有限或无标签的目标域。该方法的关键挑战是解决源域和目标域概率分布的分布偏移。通过减少这些特征分布间的差异,源域训练的模型能够有效泛化到目标域,确保在域差异存在时仍保持良好性能。

Figure 14 illustrates the application of discrepancy-based DA to object detection in a traffic scene under different conditions. Let Xs and Ys represent the input data and labels from the source domain,respectively,and Xt be the input data from the target domain. The model fθ is trained to minimize the difference in the distributions between the source and target domains.

图14展示了基于差异的域适应在不同条件下交通场景目标检测中的应用。设XsYs分别为源域的输入数据和标签,Xt为目标域的输入数据。模型fθ通过最小化源域和目标域分布差异进行训练。

The first step is to learn the model on the source domain by minimizing a classification loss Lsource  ,such as cross-entropy:

第一步是在源域上通过最小化分类损失Lsource (如交叉熵)来学习模型:

(53)Lsource =E(Xs,Ys)P(Xs,Ys)[(fθ(Xs),Ys)],

where is the classification loss function (e.g.,cross-entropy loss),and fθ(Xs) is the model prediction for the source input dataXs .

其中为分类损失函数(例如交叉熵损失),fθ(Xs)为模型对源域输入dataXs的预测。

Next, the discrepancy between the source and target distributions in the feature space is minimized using a discrepancy distance metric. Common choices include Wasserstein distance, KL divergence, and MMD. Each metric has unique strengths and application scenarios, and understanding their differences is critical to selecting the appropriate tool for DA tasks.

接下来,使用差异距离度量最小化特征空间中源域和目标域分布的差异。常用的度量包括Wasserstein距离、KL散度和最大均值差异(MMD)。每种度量具有独特优势和适用场景,理解它们的差异对于选择合适的域适应工具至关重要。

The Wasserstein distance, also known as the Earth Mover's Distance, dates back to 1781 and was later formalized in a modern optimization framework by [82]. It measures the minimum cost of transporting one probability distribution to match another:

Wasserstein距离,也称为地球搬运者距离(Earth Mover's Distance),起源于1781年,后来由[82]在现代优化框架中形式化。它衡量将一个概率分布转移以匹配另一个概率分布的最小成本:

W(P(Xs),P(Xt))

(54)=finf (γΠ(P(Xs),P(Xt)))E(Xs,Xt)γ[XsXt],

where Π(P(Xs),P(Xt)) denotes the set of joint distributions with marginals P(Xs) and P(Xt) . Here, finf  represents the infimum, or the largest lower bound, of the expected cost over all joint distributions γ with these specified marginals. Unlike a strict minimum, the infimum allows for cases where the smallest value might not be precisely attainable, but a bound exists. Compared to other metrics, the Wasserstein distance provides a notion of "how much" one distribution must be transformed to match another, offering a natural and interpretable way to compare distributions. Notably, it is well-suited for cases where the support of the source and target distributions are disjoint, a situation where other metrics like KL divergence fail due to the zero-probability issue.

其中Π(P(Xs),P(Xt))表示边缘分布为P(Xs)P(Xt)的联合分布集合。这里,finf 表示在所有具有指定边缘分布的联合分布γ上的期望成本的下确界(infimum),即最大下界。与严格的最小值不同,下确界允许最小值可能无法精确达到,但存在界限。相比其他度量,Wasserstein距离提供了“需要多少”变换量来匹配两个分布的概念,提供了一种自然且可解释的分布比较方式。值得注意的是,它特别适用于源分布和目标分布支持集不相交的情况,而其他度量如KL散度因零概率问题而失效。

However, the computational cost of calculating Wasser-stein distance is often higher than other metrics, as it involves solving a linear programming problem. This limits its applicability to scenarios with smaller datasets or where computational efficiency is paramount.

然而,计算Wasserstein距离的计算成本通常高于其他度量,因为它涉及求解线性规划问题。这限制了其在数据集较小或计算效率要求较高的场景中的应用。

The KL divergence, first introduced in [157], measures the relative entropy between the source and target distributions:

KL散度,最早由[157]提出,衡量源分布与目标分布之间的相对熵:

(55)DKL(P(Xs)P(Xt))=P(Xs)log(P(Xs)P(Xt))dX,

where P(Xs) and P(Xt) represent the probability distributions of the source and target domains, respectively. KL divergence is asymmetric and measures how much information is lost when approximating one distribution by another. It is especially useful when both distributions have overlapping support and P(Xt)>0 whenever P(Xs)>0 . However,in DA scenarios where the target domain contains regions with zero probability (i.e., where the source distribution has support but the target does not), KL divergence diverges, making it unsuitable for disjoint support situations.

其中P(Xs)P(Xt)分别表示源域和目标域的概率分布。KL散度是不对称的,衡量用一个分布近似另一个分布时丢失的信息量。当两个分布的支持集重叠且P(Xt)>0时,KL散度尤其有用。然而,在目标域存在零概率区域(即源分布有支持但目标分布无支持)的领域自适应(DA)场景中,KL散度会发散,因此不适用于支持集不相交的情况。

Moreover, KL divergence tends to be more sensitive to outliers compared to the Wasserstein distance and, as will be discussed, the MMD. This sensitivity arises because KL divergence heavily penalizes regions where there is a discrepancy in probability mass, which can lead to overly aggressive adaptations, especially when the target distribution contains sparse or noisy data.

此外,与Wasserstein距离及后文将讨论的MMD相比,KL散度对异常值更为敏感。这种敏感性源于KL散度对概率质量差异区域的强烈惩罚,可能导致过度激进的适应,尤其当目标分布包含稀疏或噪声数据时。

FIGURE 14. Discrepancy-based DA for object detection in a traffic scene under different conditions: The process starts with a source dataset (representing familiar conditions like clear weather) and a target dataset (representing different conditions like snowy weather). Both datasets are processed through a DNN, which extracts relevant features from each dataset, referred to as “DNN Features.” These DNN features from the source and target datasets are then compared using a discrepancy loss module, which measures and minimizes the differences between the feature sets. This helps the model align the feature representations from both domains, improving its ability to detect objects even in the unfamiliar target domain. By reducing the discrepancy, the model can leverage what it learned from the source data to adapt effectively to the target conditions. The outputs on the right show detected objects in both the source and target datasets, illustrating how the model successfully performs object detection across different environments by minimizing discrepancies in feature representation. This enables more consistent detection results regardless of varying traffic scene conditions.

图14. 基于差异的领域自适应(DA)在不同条件下的交通场景目标检测:过程始于源数据集(代表熟悉条件如晴朗天气)和目标数据集(代表不同条件如雪天)。两个数据集均通过深度神经网络(DNN)处理,提取各自的相关特征,称为“DNN特征”。随后,源和目标数据集的DNN特征通过差异损失模块进行比较,测量并最小化特征集间的差异。这有助于模型对齐两个域的特征表示,提高其在不熟悉目标域中的目标检测能力。通过减少差异,模型能够利用从源数据学到的知识,有效适应目标条件。右侧输出显示了源和目标数据集中的检测目标,展示了模型通过最小化特征表示差异,成功实现跨不同环境的目标检测,从而在多变的交通场景条件下实现更一致的检测结果。

The MMD, introduced by [158], measures the distance between the means of two distributions in a Reproducing Kernel Hilbert Space (RKHS):

MMD,由[158]提出,衡量两个分布在再生核希尔伯特空间(RKHS)中均值的距离:

MMD(P(Xs),P(Xt))=∥EXsP(Xs)[ϕ(Xs)]

(56)EXtP(Xt)[ϕ(Xt)]H,

where ϕ(X) maps the data to a higher-dimensional feature space, H is the Hilbert space,and H denotes the norm in this space.

其中ϕ(X)将数据映射到高维特征空间,H是希尔伯特空间,H表示该空间中的范数。

The MMD is advantageous in that it does not require explicit density estimation of either distribution, making it computationally efficient and straightforward to implement with kernel methods. Unlike KL divergence, it can handle distributions with disjoint support and is less sensitive to outliers, which provides more stability during training.

MMD的优势在于不需要对任一分布进行显式的密度估计,使其计算高效且易于通过核方法实现。与KL散度不同,MMD能处理支持集不相交的分布,并且对异常值不敏感,训练时更稳定。

MMD's effectiveness depends on the chosen kernel, which affects how accurately it measures source-target discrepancies. A poorly selected kernel can yield suboptimal adaptation if it fails to capture complex distributional relationships. Compared to Wasserstein distance, MMD typically runs faster but is less interpretable in terms of physical distance.

MMD的效果依赖于所选核函数,核函数影响其测量源-目标差异的准确性。若核函数选择不当,可能无法捕捉复杂的分布关系,导致适应效果不佳。相比Wasserstein距离,MMD通常运行更快,但在物理距离的可解释性方面较弱。

Metric selection depends on domain adaptation specifics, such as disjoint support, computational constraints, and noise sensitivity. For high-dimensional generative tasks, Wasserstein distance may offer greater stability, while tasks with overlapping distributions might benefit from the faster convergence of KL divergence or MMD.

度量选择取决于领域自适应的具体情况,如支持集是否不相交、计算限制及噪声敏感性。对于高维生成任务,Wasserstein距离可能提供更高的稳定性,而支持集重叠的任务则可能受益于KL散度或MMD的更快收敛。

The optimization objective,denoted as Lalign  ,is a combination of the classification loss on the source domain and the discrepancy loss between the source and target distributions. The trade-off between these two components is controlled by the regularization parameter λ :

优化目标,记作Lalign ,是源域分类损失与源域和目标域分布差异损失的组合。这两部分的权衡由正则化参数λ控制:

(57)Lalign =Lsource +λLdiscrepancy ,

Discrepancy-based DA has significantly advanced several real-world computer vision applications. These applications include autonomous driving systems [159], urban safety monitoring [160], traffic surveillance [161], smart surveillance for Person Re-ID [162], digital recognition in smart city systems [163], and efficient resource management in autonomous systems [164].

基于差异的领域自适应(DA)显著推动了多个现实计算机视觉应用的发展。这些应用包括自动驾驶系统[159]、城市安全监控[160]、交通监控[161]、智能监控中的行人重识别(Person Re-ID)[162]、智慧城市系统中的数字识别[163]以及自主系统中的高效资源管理[164]。

Drawing on the teacher-student paradigm [156], [159] introduces Masked Retraining (MRT) for domain-adaptive object detection. Using a custom masked autoencoder and selective retraining,MRT achieves 51.2mAP from Cityscapes to Foggy Cityscapes, improving adaptability and accuracy by capturing target domain traits and handling incorrect pseudo labels.

借鉴师生范式[156],[159]提出了用于领域自适应目标检测的掩码重训练(Masked Retraining,MRT)。通过定制的掩码自编码器和选择性重训练,MRT实现了从Cityscapes到Foggy Cityscapes的51.2mAP,通过捕捉目标域特征并处理错误伪标签,提高了适应性和准确性。

Building on DETR [94], [160] proposes a robust baseline for DETR-style detectors under domain shift. Incorporating Object-Aware Alignment (OAA) and Optimal Transport Alignment (OTA), it mitigates shifts in both backbone and decoder outputs,raising mAP to 46.8% for Cityscapes to Foggy Cityscapes adaptation.

基于DETR[94],[160]提出了在领域偏移下适用于DETR风格检测器的鲁棒基线。通过引入面向对象的对齐(Object-Aware Alignment,OAA)和最优传输对齐(Optimal Transport Alignment,OTA),缓解了主干网络和解码器输出的偏移,使Cityscapes到Foggy Cityscapes的mAP提升至46.8%。

ML-ANet (Multi-Label Adaptation Network) [161], reduces source-target domain discrepancy using multiple kernel variants with MMD. Task-specific hidden layers are embedded in RKHS (Reproducing Kernel Hilbert Space) to align feature distributions across domains, resulting in improved efficiency. ML-ANet achieves a mean accuracy of 94.83% for Cityscapes to Foggy Cityscapes benchmark.

ML-ANet(多标签适应网络)[161]利用多种核变体和最大均值差异(MMD)减少源域与目标域的差异。任务特定的隐藏层嵌入在再生核希尔伯特空间(RKHS)中,实现跨域特征分布的对齐,从而提升效率。ML-ANet在Cityscapes到Foggy Cityscapes基准上达到94.83%的平均准确率。

D-MMD (Dissimilarity-based MMD) loss [162] addresses the challenges of UDA in Person Re-ID by aligning pairwise dissimilarities between source and target domains rather than feature representations. This approach achieves an mAP of 48.8% on the DukeMTMC to Market1501 benchmark, without requiring data augmentation or complex network designs.

D-MMD(基于不相似性的最大均值差异)损失[162]通过对齐源域和目标域之间的成对不相似性而非特征表示,解决了行人重识别(Person Re-ID)中无监督领域自适应(UDA)的挑战。该方法在DukeMTMC到Market1501基准上实现了48.8%的mAP,无需数据增强或复杂网络设计。

In [163], the sliced Wasserstein discrepancy (SWD) is introduced for UDA, combining task-specific decision boundary alignment with Wasserstein distance. Validations include digit/sign recognition (98.6±0.3% on SYNSIG GTSRB),image classification (76.4% mean accuracy on VisDA 2017), semantic segmentation (44.5% mIoU from GTA5 Cityscapes),and object detection (5.9% mAP on VisDA 2018).

[163]中引入了切片Wasserstein差异(Sliced Wasserstein Discrepancy,SWD)用于无监督领域自适应,结合了任务特定的决策边界对齐与Wasserstein距离。验证任务包括SYNSIG和GTSRB上的数字/标志识别(98.6±0.3%,VisDA 2017上的图像分类(平均准确率76.4%),GTA5到Cityscapes的语义分割(44.5% mIoU),以及VisDA 2018上的目标检测(5.9% mAP)。

Selective adaptation for object detection (TDOD), leveraging domain gap metrics such as MMD, DSS, and SWD, is proposed in [164] to perform adaptation only when necessary. This approach minimizes costs while maintaining accuracy. On the DGTA benchmark, a no-adaptation model achieves 90.3% AP50 (clear daytime overcast),whereas selective adaptation improves performance to 93.1%, delivering significant energy savings.

[164]提出了选择性适应的目标检测(TDOD),利用领域差距度量如MMD、DSS和SWD,仅在必要时执行适应。该方法在降低成本的同时保持准确性。在DGTA基准上,无适应模型在晴朗白天到阴天的场景中达到90.3% AP50,而选择性适应将性能提升至93.1%,显著节省了能耗。

Discrepancy-based DA aligns features across domains using metrics like MMD and Wasserstein distance, handling variations in layout, color, and environment. It bolsters robustness in Re-ID, tracking, and action recognition, aiding cross-camera tracking, behavior analysis, and intelligent traffic systems. However, these methods may struggle with complex shifts not fully captured by distance metrics and can be computationally intensive, requiring extensive tuning and resources. This complexity may limit their scalability in large, real-time systems.

基于差异的领域自适应通过MMD和Wasserstein距离等度量对齐跨域特征,处理布局、颜色和环境的变化。它增强了行人重识别、跟踪和动作识别的鲁棒性,支持跨摄像头跟踪、行为分析和智能交通系统。然而,这些方法可能难以应对距离度量无法完全捕捉的复杂偏移,且计算开销较大,需要大量调优和资源,限制了其在大规模实时系统中的可扩展性。

C. ADVERSARIAL-BASED DOMAIN ADAPTATION

C. 基于对抗的领域自适应

Adversarial-based DA is a technique for adapting a model trained on one domain, called the source domain, to achieve strong performance on a different domain, known as the target domain. This approach employs adversarial learning to reduce discrepancies between the two domains, allowing the model to generalize effectively. The core idea is to train a feature extractor that makes the data representations from both the source and target domains indistinguishable to a domain discriminator, while also ensuring the model performs well on the original source domain task.

基于对抗的领域自适应是一种将模型从源域迁移到目标域以实现良好性能的技术。该方法利用对抗学习减少两个域之间的差异,使模型能够有效泛化。核心思想是训练一个特征提取器,使源域和目标域的数据表示对域判别器不可区分,同时确保模型在源域任务上的表现良好。

Figure 15 illustrates the use of adversarial-based domain adaptation for segmenting a traffic scene. Let Ds= {(xis,yis)}i=1ns be the labeled source domain dataset,where xis is the i -th feature vector in the source domain and yis is its corresponding label. Let Dt={xjt}j=1nt be the unlabeled target domain dataset,where xjt is the j -th feature vector in the target domain. The goal is to train a model that performs well on the target domain despite the difference in data distributions between the source and target domains.

图15展示了基于对抗的领域自适应在交通场景分割中的应用。设Ds=为带标签的源域数据集,其中xis是源域中的第i个特征向量,yis是其对应的标签。设Dt={xjt}j=1nt为无标签的目标域数据集,其中xjt是目标域中的第j个特征向量。目标是训练一个模型,使其在源域和目标域数据分布存在差异的情况下,仍能在目标域上表现良好。

FIGURE 15. Application of Adversarial-based DA to segmentation of a traffic scene: The figure shows a labeled source domain dataset and an unlabeled target domain dataset, where the source and target domains are captured under different weather conditions. The goal of this method is to achieve effective segmentation in the target domain, despite differences in data distribution between the source and target domains and variations in weather. The adversarial-based DA setup involves a shared-weight feature extractor and a domain adversarial training mechanism to align the feature spaces of both domains. The feature extractor is designed to map both source and target domain data into a shared feature space, minimizing domain-specific distinctions, including those caused by different weather conditions. The classifier predicts segmentation labels for the source domain, while the domain discriminator attempts to distinguish between source and target domain features. During training, the feature extractor is optimized to fool the domain discriminator, leading to more domain-invariant feature representations. Segmentation accuracy is further enhanced by utilizing insights from the source domain's class size distribution, which helps to regulate the constrained mutual information loss in the target domain. The combination of classification and adversarial feature losses are optimized to ensure that the segmentation model generalizes well to the target domain. Ultimately, this process results in accurate segmentation of the traffic scene, regardless of weather conditions.

图15. 基于对抗的领域自适应在交通场景分割中的应用:图中展示了一个带标签的源域数据集和一个无标签的目标域数据集,源域和目标域数据分别采集于不同的天气条件下。该方法的目标是在源域和目标域数据分布及天气条件存在差异的情况下,实现目标域的有效分割。基于对抗的领域自适应设置包括一个共享权重的特征提取器和一个领域对抗训练机制,用以对齐两个域的特征空间。特征提取器旨在将源域和目标域数据映射到共享特征空间,最小化域特异性差异,包括由不同天气条件引起的差异。分类器预测源域的分割标签,而领域判别器试图区分源域和目标域的特征。在训练过程中,特征提取器被优化以欺骗领域判别器,从而获得更具域不变性的特征表示。通过利用源域类别大小分布的信息,进一步提升分割准确率,这有助于调节目标域中的受限互信息损失。分类损失与对抗特征损失的结合被优化,以确保分割模型在目标域上的良好泛化。最终,该过程实现了对交通场景的准确分割,无论天气条件如何变化。

Adversarial-based DA typically involves a feature extractor (Gf) ,a classifier (Gy) ,and a domain discriminator (Gd) . The feature extractor maps source and target data into a shared space, while the classifier predicts labels for source data. The domain discriminator tries to distinguish source from target features. During training, Gf learns to fool Gd , making source and target features indistinguishable. Variations include multiple discriminators, reconstruction losses, gradient reversal layers, and other alignment techniques. Some methods add extra objectives for specific feature properties or multi-level alignment, but the core aim is to extract domain-invariant features that generalize effectively to the target domain.

基于对抗的领域自适应通常包括一个特征提取器(Gf)、一个分类器(Gy)和一个领域判别器(Gd)。特征提取器将源域和目标域数据映射到共享空间,分类器预测源域数据的标签。领域判别器试图区分源域和目标域的特征。在训练过程中,Gf学习欺骗Gd,使源域和目标域特征难以区分。变体包括多个判别器、重构损失、梯度反转层及其他对齐技术。一些方法增加了针对特定特征属性或多层次对齐的额外目标,但核心目标是提取能够有效泛化到目标域的域不变特征。

The objective for the classifier and feature extractor on the source domain is to minimize the classification loss Ly , typically cross-entropy:

分类器和特征提取器在源域上的目标是最小化分类损失Ly,通常为交叉熵损失:

(58)Ly=1nsi=1ns(Gy(Gf(xis)),yis),

where Gf(xis) is the feature representation of source data, Gy(Gf(xis)) is the predicted label,and is the cross-entropy loss.

其中Gf(xis)是源数据的特征表示,Gy(Gf(xis))是预测标签,是交叉熵损失。

The domain discriminator Gd is trained to distinguish between source and target features by minimizing the binary classification loss:

领域判别器Gd通过最小化二分类损失来区分源域和目标域特征:

Ld=1nsi=1nslog(Gd(Gf(xis)))

(59)1ntj=1ntlog(1Gd(Gf(xjt))),

where Gd(Gf(xis)) is the probability that the source feature is classified as belonging to the source domain,and Gd(Gf(xjt)) is the probability that the target feature is classified as belonging to the target domain.

其中Gd(Gf(xis))是源特征被判定为源域的概率,Gd(Gf(xjt))是目标特征被判定为目标域的概率。

To ensure Gf produces domain-invariant features,it is trained to fool Gd by maximizing the domain discriminator’s loss, making source and target features similar. Formally, Lf=Ld

为了确保Gf生成域不变特征,它通过最大化领域判别器的损失来欺骗Gd,使源域和目标域特征相似。形式化地,Lf=Ld

The overall loss combines the source classification loss and the adversarial domain confusion loss:

整体损失结合了源域分类损失和对抗领域混淆损失:

L=Ly+λLf

where λ balances domain confusion against classification accuracy.

其中λ平衡领域混淆与分类准确率。

Training alternates between two steps: training Gd to distinguish source from target by minimizing Ld ,and training Gf to minimize Ly and Lf . This adversarial process aligns source and target features, improving generalization to the target domain.

训练在两个步骤之间交替进行:训练Gd通过最小化Ld来区分源域和目标域,训练Gf以最小化LyLf。这种对抗过程使源域和目标域特征对齐,提升了对目标域的泛化能力。

I2IT, introduced in [165], converts an image from one domain to another, preserving core structure but adapting style or characteristics. It's used in photo enhancement, style transfer, and data augmentation. Though not inherently a DA task, I2IT methods are widely applied for DA ( [166], [167]), often using GAN-based frameworks [124]. CycleGAN [137] is a key milestone in I2IT and underpins many adversarial DA approaches in Table 4.

I2IT(图像到图像转换),如文献[165]所述,将图像从一个域转换到另一个域,保持核心结构不变,同时调整风格或特征。它被用于照片增强、风格迁移和数据增强。尽管本质上不是域适应(DA)任务,I2IT方法被广泛应用于DA([166],[167]),通常采用基于生成对抗网络(GAN)的框架[124]。CycleGAN[137]是I2IT领域的重要里程碑,支撑了表4中许多对抗域适应方法。

CycleGAN introduces a cycle consistency loss to ensure that the original image can be recovered after a round-trip translation (XYX) . This is defined as:

CycleGAN引入了循环一致性损失,以确保经过往返转换后能够恢复原始图像(XYX)。其定义如下:

Lcycle (G,G)=Expdata (x)[G(G(x))x1]

(60)+Eypdata (y)[G(G(y))y1],

where G:XY and G:YX are the forward and backward generators,respectively,and 1 denotes the L1 norm. The total loss for training CycleGAN is the sum of this cycle-consistent loss and the adversarial loss as in Equation 59.

其中G:XYG:YX分别为正向和反向生成器,1表示L1范数。CycleGAN的总训练损失是该循环一致性损失与对抗损失之和,如公式59所示。

Adversarial-based DA has been applied in traffic scene understanding for tasks like day-to-night translation [168], haze synthesis and removal [148], semantic segmentation [169], object detection [170], [171], [172], [173], scene classification [174], [175], scene segmentation [176], and cross-domain adaptation in challenging environments [177], [178], [179], [180], [181], [182]. Moreover, these methods contribute to fair scene adaptation in urban monitoring [183].

基于对抗的域适应已应用于交通场景理解中的多项任务,如昼夜转换[168]、雾霾合成与去除[148]、语义分割[169]、目标检测[170],[171],[172],[173]、场景分类[174],[175]、场景分割[176]以及复杂环境下的跨域适应[177],[178],[179],[180],[181],[182]。此外,这些方法还促进了城市监控中的公平场景适应[183]。

The Fréchet Inception Distance (FID) score measures similarity between real and generated images by calculating the Fréchet (Wasserstein-2) distance between their multivariate Gaussian distributions, based on means and covariances. It uses features from an intermediate layer of the Inception network [184], capturing both visual quality and diversity.

Fréchet Inception Distance(FID)分数通过计算真实图像和生成图像的多元高斯分布之间的Fréchet(Wasserstein-2)距离来衡量它们的相似性,该距离基于均值和协方差。它利用Inception网络[184]中间层的特征,综合反映视觉质量和多样性。

The adversarial method in [168] (referred to as "Day-toNight" in our work) uses CycleGAN [137] for day-to-night translation, enhanced by transfer learning from semantic segmentation. Trained on BDD segmentation data and adapted to Tokyo, it handles unique lighting conditions, improving object detection ( 55.3% to 57.2%mAP ) and segmentation (59.5% to 61.6% mIoU). Outperforming SemGAN [185], AugGAN [186], and CycleGAN, it achieves an FID score of 39.26 and generates realistic night images for Tokyo scenes.

文献[168]中的对抗方法(在本文中称为“昼夜转换”)使用CycleGAN[137]进行昼夜图像转换,并通过语义分割的迁移学习进行增强。该方法在BDD分割数据上训练并适配东京场景,处理独特的光照条件,提升了目标检测(55.3%57.2%mAP)和分割性能(mIoU从59.5%提升至61.6%)。其表现优于SemGAN[185]、AugGAN[186]和CycleGAN,FID分数达到39.26,生成了逼真的东京夜景图像。

ParaTeacher [179] introduces a UDA approach combining a Style-Content Discriminated Data Recombination (SCD-DR) module for data refinement and an Iterative Cross-Domain Knowledge Transferring (ICD-KT) module for knowledge enhancement. Integrated with Faster R-CNN, it boosts mAP by 5%10% ,achieving 44.59% on the Virtual KITTI to KITTI benchmark, significantly narrowing the synthetic-to-real data gap.

ParaTeacher[179]提出了一种结合风格-内容区分数据重组(SCD-DR)模块进行数据精炼和迭代跨域知识转移(ICD-KT)模块进行知识增强的无监督域适应(UDA)方法。该方法集成于Faster R-CNN中,提升了mAP5%10%,在Virtual KITTI到KITTI基准测试中达到44.59%,显著缩小了合成数据与真实数据之间的差距。

PanopticGAN [180] proposes a GAN framework for panoptic-aware I2IT, improving image quality and object recognition with a feature masking module and a compact thermal dataset. It enhances boundary sharpness and segmentation, achieving superior fidelity and an FID score of 69.4, outperforming existing methods.

PanopticGAN[180]提出了一种面向全景感知的I2IT生成对抗网络框架,通过特征掩码模块和紧凑的热成像数据集提升图像质量和目标识别能力。该方法增强了边界清晰度和分割效果,实现了优异的保真度,FID分数为69.4,优于现有方法。

The CyCADA model [181] combines discriminative training with cycle-consistent adversarial DA at pixel and feature levels, requiring no aligned image pairs. It proves effective in semantic segmentation,achieving 39.5mLoU in GTA5 to CityScapes adaptation.

CyCADA模型[181]结合了判别训练和像素级及特征级的循环一致对抗域适应,无需对齐的图像对。该模型在语义分割任务中表现出色,实现了GTA5到CityScapes适配中的39.5mLoU

The model in [174] (referred to as "UDAofUrbanScenes" in our work) combines supervised learning on synthetic data, adversarial learning between labeled synthetic and unlabeled real data, and self-teaching guided by segmentation confidence. It adapts urban scene segmentation from synthetic datasets (GTA5, SYNTHIA) to real-world datasets (Cityscapes), improving performance on rare classes and achieving 30.2% mIoU.

文献[174]中的模型(本文称为“城市场景UDA”)结合了合成数据的监督学习、带标签合成数据与无标签真实数据之间的对抗学习,以及由分割置信度引导的自我教学。该方法实现了从合成数据集(GTA5,SYNTHIA)到真实数据集(Cityscapes)的城市场景分割适应,提升了稀有类别的性能,mIoU达到30.2%。

CPGAN [172] enhances vehicle detection in foggy conditions with an improved CycleGAN [137] for style transfer and pre-trained YOLOv4,achieving 69.24%mAP50 on HVFD (normal to foggy). Its adversarial setup, with two generators and discriminators plus perceptual consistency loss, enables effective pixel and feature-level adaptation, improving detection in foggy environments.

CPGAN [172] 通过改进的CycleGAN [137]进行风格迁移和预训练的YOLOv4,提升了雾天条件下的车辆检测,在HVFD(正常到雾天)上实现了69.24%mAP50。其对抗结构包含两个生成器和判别器以及感知一致性损失,实现了有效的像素级和特征级适应,提升了雾天环境下的检测性能。

The I2IT model [182] (referred to as "SGND" in our work) introduces a multi-task unsupervised NN for day-to-night translation using adversarial training. It combines semantic segmentation and geometric depth as spatial attention maps on the BDD dataset. Featuring a generator for conversion and a discriminator for realism, SGND achieves an FID score of 31.245, superior KID metrics, and improved realism, accuracy, and domain mapping.

I2IT模型[182](在本工作中称为“SGND”)引入了一种多任务无监督神经网络,利用对抗训练实现昼夜转换。该模型结合了语义分割和几何深度作为BDD数据集上的空间注意力图。SGND包含一个转换生成器和一个真实性判别器,取得了31.245的FID分数、更优的KID指标,以及更高的真实感、准确性和域映射效果。

FIGURE 16. Training procedure of adversarial feature adaptation for traffic scene classification. The process involves three key steps: In Step 0, the source-specific feature extractor (E5) and classifier (C) are trained with source images to minimize classification errors via cross-entropy (CE) loss. In Step 1, the feature generator (S) and discriminator (D₁) are trained to produce domain-specific generated features, using noise conditioning and source labels to enhance the adaptation capability. In Step 2, the shared encoder (E,I) and discriminator (D 2 ) are trained to align the source and target domains through adversarial loss (GAN loss), ensuring domain invariance. The dashed lines represent auxiliary feature generation pathways, while solid lines depict the main training process. By leveraging both real and GAN-generated data, this procedure enhances the model’s robustness for classification under diverse and challenging conditions.

图16. 用于交通场景分类的对抗特征适应训练流程。该过程包含三个关键步骤:步骤0中,源域特定特征提取器(E5)和分类器(C)通过交叉熵(CE)损失在源图像上训练以最小化分类误差。步骤1中,特征生成器(S)和判别器(D₁)训练生成域特定特征,利用噪声条件和源标签增强适应能力。步骤2中,共享编码器(E,I)和判别器(D 2)通过对抗损失(GAN损失)对齐源域和目标域,确保域不变性。虚线表示辅助特征生成路径,实线表示主要训练流程。通过结合真实数据和GAN生成数据,该流程增强了模型在多样且复杂条件下的分类鲁棒性。

An innovative unsupervised I2IT framework is introduced in [148] that leverages both VAE and GAN, along with an MMD-based VAE, which utilizes MMD as a discrepancy measure to align latent distributions effectively. This discrepancy-based alignment allows the framework to handle both haze image synthesis and haze removal in a unified manner, demonstrating promising results on the Apollo dataset, with PSNR and SSIM metrics of 27.3772 and 0.9271, respectively

文献[148]提出了一种创新的无监督I2IT框架,结合了变分自编码器(VAE)和生成对抗网络(GAN),以及基于最大均值差异(MMD)的VAE,利用MMD作为差异度量有效对齐潜在分布。基于差异的对齐使该框架能够统一处理雾图像合成和去雾任务,在Apollo数据集上表现出良好效果,PSNR和SSIM指标分别为27.3772和0.9271。

In [169], a novel teacher-student [156] approach for unsupervised domain-adaptive semantic segmentation in memory-constrained models (referred to as DRN-D-BasedDA in our work) is presented. The method employs a multi-level strategy with adversarial learning and uses a custom cross-entropy loss with pseudo-teacher labels to address domain gaps and memory constraints. DRN-D-BasedDA improves adaptability in both real and synthetic scenarios,achieving an mIoU of 37.35% from GTA5 to Cityscapes.

文献[169]提出了一种新颖的教师-学生[156]方法,用于内存受限模型中的无监督域自适应语义分割(本工作称为DRN-D-BasedDA)。该方法采用多层次策略结合对抗学习,使用带伪教师标签的自定义交叉熵损失以解决域差异和内存限制问题。DRN-D-BasedDA提升了真实和合成场景下的适应性,实现了从GTA5到Cityscapes的mIoU为37.35%

Adversarial Feature Adaptation (AFA), as first introduced in [187], is a technique for UDA that enhances model robustness and generalization by augmenting training data with adversarially generated features. It employs a domain-invariant feature extractor trained via feature space data augmentation, utilizing GANs to broaden the input feature distribution. This method aims to improve the model's generalization to unseen data, especially useful in scenarios requiring resilience to adversarial examples or when training data is scarce or non-representative. Like with I2IT adversarial models, it has been applied to overcome domain shift problems for various traffic vision tasks including object detection [170], [171], traffic scene classification [175], and traffic scene segmentation [176]. Figure 16 shows the training procedure of an AFA to be applied to a traffic scene classification problem.

对抗特征适应(AFA)首次在文献[187]中提出,是一种用于无监督域自适应(UDA)的技术,通过对抗生成的特征增强训练数据,提升模型的鲁棒性和泛化能力。该方法利用域不变特征提取器,通过特征空间数据增强和GAN扩展输入特征分布,旨在提升模型对未见数据的泛化能力,尤其适用于需要抵抗对抗样本或训练数据稀缺/不具代表性的场景。与I2IT对抗模型类似,AFA已应用于解决多种交通视觉任务的域偏移问题,包括目标检测[170],[171]、交通场景分类[175]和交通场景分割[176]。图16展示了应用于交通场景分类问题的AFA训练流程。

AFAN [170] merges an advanced UDA framework for object detection with an intermediate domain image generator and domain-adversarial training with soft domain labels, significantly enhancing feature alignment through feature pyramid and region feature alignment techniques. This comprehensive approach fosters domain-invariant feature learning and achieves a notable mAP of 41.4% in object detection adapted from the CityScapes to the KITTI benchmark.

AFAN [170]融合了先进的无监督域自适应框架,用于目标检测,结合中间域图像生成器和带软域标签的域对抗训练,通过特征金字塔和区域特征对齐技术显著增强了特征对齐能力。该综合方法促进了域不变特征学习,在从CityScapes到KITTI基准的目标检测中实现了显著的mAP为41.4%

SADA (Sparse Adversarial Domain Adaptation) [175] tackles weather-related domain shift in traffic scene classification [175]. With 93.20% accuracy (Sunny to Cloudy) on HSD dataset, it employs a unique sparse adversarial deep NN. This model captures sparse data from source scenes and aligns them with target images, extracting domain-invariant features for accurate classification. SADA outperforms existing methods, showcasing the power of sparse data and adversarial domain alignment in deep networks.

SADA(稀疏对抗域自适应)[175]解决了交通场景分类中的天气相关域偏移问题。在HSD数据集上(晴天到多云)实现了93.20%的准确率,采用独特的稀疏对抗深度神经网络。该模型捕捉源场景的稀疏数据并与目标图像对齐,提取域不变特征以实现准确分类。SADA优于现有方法,展示了稀疏数据和对抗域对齐在深度网络中的强大效能。

The study in [176] introduces an innovative UDA model employing a sparse adversarial multi-target approach to address domain shifts in real-world traffic scenes. Achieving a segmentation accuracy of 76.13 IoU on the ACDC dataset, it outperforms state-of-the-art methods, demonstrating the effectiveness of sparse representation compared to deep dense alternatives under diverse environmental conditions.

文献[176]提出了一种创新的无监督域自适应模型,采用稀疏对抗多目标方法应对真实交通场景中的域偏移。在ACDC数据集上实现了76.13的IoU分割准确率,优于最先进方法,证明了稀疏表示相较于深度密集方法在多样环境条件下的有效性。

The approach proposed in [171] for handling foggy and rainy conditions combines image and object-level adaptations with an adversarial gradient reversal layer to mine challenging examples. Additionally, it employs an auxiliary domain via data augmentation to introduce new domain-level metric regularization. This method achieves a detection mAP of 45.0% when transferring from CityScapes to Rainy CityScapes, and 42.3% from CityScapes to Foggy CityScapes.

[171]中提出的处理雾天和雨天条件的方法结合了图像级和目标级的适应,并通过对抗性梯度反转层挖掘具有挑战性的样本。此外,该方法通过数据增强引入辅助域,实现新的域级度量正则化。该方法在从CityScapes迁移到Rainy CityScapes时达到45.0%的检测mAP,在从CityScapes迁移到Foggy CityScapes时达到42.3%。

FREDOM [183] addresses fairness in DA for semantic scene understanding, leveraging transformer networks [188] to model conditional structures and balance class distributions. By utilizing self-supervised loss with pseudo labels and introducing a conditional structural constraint, it achieves mIoU accuracies of 67.0% for SYNTHIA Cityscapes and 73.6% for GTA5 Cityscapes,emphasizing equitable performance across classes.

FREDOM [183]针对语义场景理解中的域适应公平性问题,利用变换器网络(transformer networks)[188]建模条件结构并平衡类别分布。通过使用带伪标签的自监督损失并引入条件结构约束,实现了SYNTHIA Cityscapes的mIoU准确率为67.0%,GTA5 Cityscapes为73.6%,强调了各类别间的公平性能表现。

The Self-Adversarial Disentangling (SAD) framework, proposed in [178], addresses the challenge of adapting to specific domain shifts in DA by introducing the concept of Specific DA (SDA) and mitigating intra-domain gaps through a domainness creator and self-adversarial regularizer, achieving 45.2%mAP on the benchmark for Cityscapes to Foggy Cityscapes.

[178]提出的自对抗解耦(Self-Adversarial Disentangling,SAD)框架,通过引入特定域适应(Specific DA,SDA)概念,并通过域特性生成器和自对抗正则器缓解域内差距,解决了域适应中特定域偏移的挑战,在Cityscapes到Foggy Cityscapes的基准测试中取得了45.2%mAP的成绩。

The authors of [177] proposed Category-induced Coarse-to-Fine DA (C2FDA) to address the challenges of adapting object detection models to unseen and complex traffic environments. They introduced three key components: Attention-induced Coarse-Grained Alignment (ACGA), Attention-induced Feature Selection, and Category-induced Fine-Grained Alignment (CFGA). Their approach achieved 48.9% AP on synthetic-to-real adaptation (SIM10K to Cityscapes), 40.5%mAP on weather adaptation (Cityscapes to Foggy Cityscapes), and 48.0% AP on cross-camera adaptation (KITTI to Cityscapes).

[177]的作者提出了类别引导的粗到细域适应(Category-induced Coarse-to-Fine DA,C2FDA),以应对目标检测模型在未知且复杂交通环境中的适应难题。该方法引入了三个关键组件:注意力引导的粗粒度对齐(ACGA)、注意力引导的特征选择和类别引导的细粒度对齐(CFGA)。其方法在合成到真实的适应(SIM10K到Cityscapes)中达到48.9%的AP,在天气适应(Cityscapes到Foggy Cityscapes)中达到40.5%mAP,在跨摄像头适应(KITTI到Cityscapes)中达到48.0%的AP。

DAAF (Domain Adaptation of Anchor-Free) object detection method [173], tackles the challenges of cross-domain object detection in complex urban traffic scenarios. It utilizes fully convolutional adversarial training for global feature alignment and incorporates Pixel-Level Adaptation (PLA) for local feature alignment. This approach achieves an AP50 of 53.4% when transferring from SIM 10K to Cityscapes,and 37.87% for SIM 10K to KITTI.

DAAF(无锚域适应)目标检测方法[173],解决了复杂城市交通场景中的跨域目标检测挑战。该方法利用全卷积对抗训练实现全局特征对齐,并结合像素级适应(Pixel-Level Adaptation,PLA)实现局部特征对齐。在从SIM 10K迁移到Cityscapes时,该方法实现了AP50为53.4%,从SIM 10K到KITTI时达到37.87%。

Adversarial-based DA enhances traffic tasks like reidentification, tracking, and action recognition by using adversarial training to create domain-invariant features, minimizing domain-specific biases and enabling robust performance across varying conditions. This supports applications such as traffic flow optimization, anomaly detection, and cross-camera tracking, vital for autonomous driving and intelligent traffic systems. However, in traffic scene understanding, adversarial-based DA requires careful tuning to balance adversarial loss with task-specific accuracy, as misalignment can lead to incorrect identification or tracking. It may also struggle in rapidly changing traffic environments, where maintaining consistent feature alignment is challenging, impacting the reliability of tracking and action recognition, especially in dense, dynamic traffic scenarios.

基于对抗的域适应通过对抗训练生成域不变特征,减少域特定偏差,提升了交通任务如重识别、跟踪和动作识别的性能,实现了跨条件的鲁棒表现。这支持了交通流优化、异常检测和跨摄像头跟踪等应用,对于自动驾驶和智能交通系统至关重要。然而,在交通场景理解中,基于对抗的域适应需要谨慎调节对抗损失与任务特定准确率的平衡,因不当对齐可能导致错误识别或跟踪。此外,在快速变化的交通环境中,维持一致的特征对齐较为困难,影响跟踪和动作识别的可靠性,尤其是在密集且动态的交通场景中。

D.HPO FOR DOMAIN ADAPTATION MODELS

域适应模型的超参数优化(D.HPO)

HPO enhances the performance of clustering-based, discrepancy-based, and adversarial DA models, especially for tasks like Person Re-ID, object detection, and semantic segmentation. By fine-tuning hyperparameters such as learning rates, loss coefficients, and architectural choices, HPO optimizes model components for effectiveness across domains.

超参数优化(HPO)提升了基于聚类、基于差异和基于对抗的域适应模型的性能,尤其适用于行人重识别、目标检测和语义分割等任务。通过微调学习率、损失系数和架构选择等超参数,HPO优化了模型组件以实现跨域的有效性。

In clustering-based DA methods, optimizing network architectures, loss weights, learning rates, and data augmentation strategies improves pseudo-label reliability and domain alignment. For instance, in CDCL [149], HPO balances learning rates, weight decay, and contrastive loss temperature-a hyperparameter that scales the similarity measure in contrastive loss, fine-tuning feature differentiation. This results in improved feature extraction and an mAP of 81.5% on DukeMTMC-ReID to Market1501. DMD [150] benefits from tuning graph learning rates, distillation weights, which control the influence of knowledge transfer from a teacher model to a student model, and batch size, achieving an mAP of 92.7%. In object detection, ConfMix [151] adjusts adversarial parameters, such as pseudo-label confidence (0.7 to 0.9), which sets a threshold for pseudo-label acceptance to enhance reliability, and NMS thresholds (0.3- 0.5 ),achieving an mAP of 52.2% when adapting from KITTI to Cityscapes. FFREEDA [152] tunes learning rates for cluster-specific classifiers (0.0005 to 0.0015) and loss weights,achieving an mIoU of 40.16±1.02% from GTA5 to Mapillary.

在基于聚类的领域自适应(DA)方法中,优化网络架构、损失权重、学习率和数据增强策略能够提升伪标签的可靠性和领域对齐效果。例如,在CDCL [149]中,超参数优化(HPO)平衡了学习率、权重衰减和对比损失温度——一个用于缩放对比损失中相似度度量的超参数,微调特征区分能力。这提升了特征提取效果,并在DukeMTMC-ReID到Market1501的迁移中实现了81.5%的mAP。DMD [150]通过调节图学习率、蒸馏权重(控制教师模型向学生模型知识转移的影响)和批量大小,达到了92.7%的mAP。在目标检测中,ConfMix [151]调整了对抗参数,如伪标签置信度(0.7到0.9),该阈值用于伪标签接受以增强可靠性,以及非极大值抑制(NMS)阈值(0.3-0.5),实现了从KITTI到Cityscapes的mAP为52.2%。FFREEDA [152]调节了针对特定聚类分类器的学习率(0.0005到0.0015)和损失权重,实现了从GTA5到Mapillary的mIoU为40.16±1.02%

In discrepancy-based DA models, HPO optimizes domain alignment metrics like FID and MMD. In ParaTeacher [179], HPO fine-tunes modules by adjusting alignment coefficients (0.1 to 0.5) and contrastive temperatures (0.07 to 0.1), improving mAP by 5%10% to reach 44.59% on KITTI. The MRT framework [159] adjusts reconstruction loss weights and retraining epochs,enhancing mAP to 51.2% from Cityscapes to Foggy Cityscapes. DETR-style detectors [160] benefit from HPO on attention dropout rates and learning schedules,achieving an mAP of 46.8% for Cityscapes to Foggy Cityscapes. In ML-ANet [161], MMD alignment benefits from tuning kernel bandwidths-a parameter that controls the sensitivity of MMD to variations in data distributions, allowing precise domain alignment-and specific hidden layer learning rates, achieving a mean accuracy of 94.83% for Cityscapes to Foggy Cityscapes.

在基于差异的领域自适应模型中,超参数优化(HPO)用于优化领域对齐指标,如FID和MMD。在ParaTeacher [179]中,HPO通过调整对齐系数(0.1到0.5)和对比温度(0.07到0.1)微调模块,使KITTI数据集上的mAP提升了5%10%,达到44.59%。MRT框架[159]调整重建损失权重和再训练轮数,将Cityscapes到Foggy Cityscapes的mAP提升至51.2%。DETR风格的检测器[160]通过HPO优化注意力丢弃率和学习计划,实现了Cityscapes到Foggy Cityscapes的mAP为46.8%。在ML-ANet [161]中,MMD对齐通过调节核带宽——一个控制MMD对数据分布变化敏感度的参数,实现精确领域对齐——以及特定隐藏层的学习率,达到了Cityscapes到Foggy Cityscapes的平均准确率94.83%。

In adversarial-based DA models, HPO plays a crucial role in refining parameters for adversarial losses, feature alignment, and I2IT techniques. In [148], a combined I2IT framework utilizes a VAE-GAN structure with an MMD-based VAE. Here, MMD serves as an effective discrepancy measure to align latent distributions, while HPO is used to balance reconstruction loss (ranging from 0.2 to 0.5 ) and adversarial weights. This approach achieved PSNR and SSIM scores of 27.3772 and 0.9271, respectively, on the Apollo dataset.

在基于对抗的领域自适应模型中,超参数优化(HPO)在细化对抗损失、特征对齐和图像到图像转换(I2IT)技术的参数方面起着关键作用。在[148]中,结合的I2IT框架采用了带有基于MMD的变分自编码器(VAE)的VAE-GAN结构。这里,MMD作为一种有效的差异度量用于对齐潜在分布,同时HPO用于平衡重建损失(范围0.2到0.5)和对抗权重。该方法在Apollo数据集上分别实现了27.3772的峰值信噪比(PSNR)和0.9271的结构相似性指数(SSIM)。

TABLE 4. A comprehensive comparison of various domain adaptive ML models applied to traffic scene understanding, highlighting the differences in applications, categories, variance across models, datasets utilized, performance metrics, and the resulting effectiveness in their respective applications.

表4. 各类应用于交通场景理解的领域自适应机器学习模型的综合比较,重点展示了应用差异、类别、模型间的变异性、所用数据集、性能指标及其在各自应用中的效果。

ApplicationCategoryVarianceDatasetPerformance MetricResult
ClassificationDiscrepancyDAN [161]Cityscapes \( \rightarrow \) Foggy CityscapesMean Accuracy91.85%
DiscrepancyML-ANet [161]Cityscapes \( \rightarrow \) Foggy CityscapesMean Accuracy94.83%
DiscrepancyMCD [163]VisDA 2017 \( \rightarrow \) MSCOCOMean Accuracy71.9%
DiscrepancySWD [163]VisDA 2017 \( \rightarrow \) MSCOCOMean Accuracy76.4%
DiscrepancyD-MMD [175]HSD(Sunny \( \rightarrow \) Cloudy/Rainy/Snowy)Mean Accuracy72.63%
AdversarialSTAR [175]HSD(Sunny \( \rightarrow \) Cloudy/Rainy/Snowy)Mean Accuracy81.25%
AdversarialDWL [175]HSD(Sunny \( \rightarrow \) Cloudy/Rainy/Snowy)Mean Accuracy82.38%
AdversarialSADA [175]\( \mathrm{{HSD}}\left( {\mathrm{{Sunny}} \rightarrow \mathrm{{Cloudy}}/\mathrm{{Rainy}}/\mathrm{{Snowy}}}\right) \)Mean Accuracy93.20%
Object DetectionClusteringCFFA [153]Cityscapes \( \rightarrow \) Foggy CityscapesmAP38.6%
DiscrepancyDefDETR [159]Cityscapes \( \rightarrow \) Foggy CityscapesmAP28.5%
AdversarialMTTrans [159]Cityscapes \( \rightarrow \) Foggy CityscapesmAP43.4%
DiscrepancyMRT [159]Cityscapes \( \rightarrow \) Foggy CityscapesmAP51.2%
DiscrepancyDeformable DETR [160]Cityscapes \( \rightarrow \) Foggy CityscapesmAP28.6%
AdversarialSFA [160]Cityscapes \( \rightarrow \) Foggy CityscapesmAP41.3%
Adversarial\( {O}^{2} \) net [160]Cityscapes \( \rightarrow \) Foggy CityscapesmAP46.8%
DiscrepancyTDOD (Without Adaptation) [164]DGTA(Clear \( \rightarrow \) Overcast)AP5090.3%
DiscrepancyTDOD (With Adaptation) [164]DGTA(Clear \( \rightarrow \) Overcast)AP5093.1%
AdversarialDaytoNight-No Augmentation [168]BDD(Day \( \rightarrow \) Night)mAP55.3%
AdversarialDaytoNight-With Augmentation [168]BDD(Day \( \rightarrow \) Night)mAP57.2%
AdversarialAFAN [170]CityScapes \( \rightarrow \) KITTImAP41.4%
AdversarialFogAndRainDA [171]CityScapes \( \rightarrow \) Rainy CityScapesmAP45.0%
AdversarialYOLOv4+CycleGAN [172]HVFD(Normal \( \rightarrow \) Foggy)mAP5067.21%
AdversarialYOLOv4+CPGAN [172]HVFD(Normal \( \rightarrow \) Foggy)mAP5069.24%
AdversarialMGA [173]SIM \( {10}\mathrm{\;K} \rightarrow \) CityscapesAP5049.8%
AdversarialDAAF [173]SIM \( {10}\mathrm{\;K} \rightarrow \) CityscapesAP5053.4%
AdversarialC2FDA [177]Cityscapes \( \rightarrow \) Foggy CityscapesmAP40.5%
AdversarialSAD [178]Cityscapes \( \rightarrow \) Foggy CityscapesmAP45.2%
DiscrepancyMTOR [179]Virtual KITTI \( \rightarrow \) KITTImAP32.75%
AdversarialParaTeacher [179]Virtual KITTI \( \rightarrow \) KITTImAP44.59%
SegmentationClusteringFFREEDA [150]GTA5 \( \rightarrow \) MapillarymloU\( {40.16} \pm {1.02} \)
DiscrepancySWD [163]GTA5 \( \rightarrow \) CityscapesmIoU44.5%
AdversarialDaytoNight-No Augmentation [168]BDD(Day \( \rightarrow \) Night)mloU59.5%
AdversarialDaytoNight-With Augmentation [168]BDD(Day \( \rightarrow \) Night)mIoU61.6%
AdversarialAdaptSegNet [169]GTA5 \( \rightarrow \) CityscapesmloU32.49%
AdversarialDRN-D-BasedDA [169]GTA5 \( \rightarrow \) CityscapesmloU37.35%
AdversarialUDAofUrbanScenes [174]GTA5 \( \rightarrow \) CityScapesmloU30.2%
AdversarialMTKT [176]ACDC(Sunny \( \rightarrow \) Cloudy/Rainy/Snowy)IoU71.01%
AdversarialLSA-UDA [176]ACDC(Sunny \( \rightarrow \) Cloudy/Rainy/Snowy)IoU76.13%
AdversarialCyCADA feature-only [181]SYNTHIA \( \rightarrow \) CityScapesmIoU31.7%
AdversarialCyCADA pixel-only [181]SYNTHIA \( \rightarrow \) CityScapesmloU37.0%
AdversarialCyCADA pixel+feature [181]SYNTHIA \( \rightarrow \) CityScapesmloU39.5%
AdversarialFREDOM [183]GTA5 \( \rightarrow \) CityScapesmIoU73.6%
12ITAdversarialUNIT [148]Apollo(Haze \( \rightarrow \) Dehaze)PSNR, SSIM24.52, 0.85
AdversarialCycleGAN [148]Apollo(Haze \( \rightarrow \) Dehaze)PSNR, SSIM25.19, 0.89
AdversarialVAE-GAN [148]Apollo(Haze \( \rightarrow \) Dehaze)PSNR, SSIM27.38, 0.93
Adversarial\( \mathrm{{AugGAN}}\left\lbrack {168}\right\rbrack \)BDD(Day \( \rightarrow \) Night)FID67.07
AdversarialSemGAN [168]BDD(Day \( \rightarrow \) Night)FID39.91
AdversarialDaytoNight [168]BDD(Day \( \rightarrow \) Night)FID39.26
AdversarialCycleGAN [168]BDD(Day \( \rightarrow \) Night)FID35.28
AdversarialMUNIT+Seg [180]Augmented KAIST-MSBDD(Day \( \rightarrow \) Night)FID98.7
AdversarialBicycleGAN+Seg [180]Augmented KAIST-MSBDD(Day \( \rightarrow \) Night)FID97.9
AdversarialSCGAN [180]Augmented KAIST-MSBDD(Day \( \rightarrow \) Night)FID92.4
AdversarialTSIT+Seg [180]Augmented KAIST-MSBDD(Day \( \rightarrow \) Night)FID80.8
AdversarialINIT [180]Augmented KAIST-MSBDD(Day \( \rightarrow \) Night)FID76.7
AdversarialPanopticGAN [180]Augmented KAIST-MSBDD(Day \( \rightarrow \) Night)FID69.4
AdversarialCycleGAN [182]BDD (Day \( \rightarrow \) Night)FID35.52
AdversarialSemGAN [182]BDD(Day \( \rightarrow \) Night)FID35.26
AdversarialAugGAN [182]BDD(Day \( \rightarrow \) Night)FID57.72
AdversarialUNIT [182]BDD(Day \( \rightarrow \) Night)FID32.66
AdversarialMUNIT [182]BDD(Day \( \rightarrow \) Night)FID69.97
AdversarialSGND [182]BDD(Day \( \rightarrow \) Night)FID31.25
Person Re-IDClusteringCDCL [149]\( \mathrm{{DukeMTMC} - {ReID}} \rightarrow \mathrm{{Market1501}} \)mAP81.5%
ClusteringDMD [150]DukeMTMC-reID \( \rightarrow \) Market1501mAP92.7%
DiscrepancyD-MMD [162]DukeMTMC \( \rightarrow \) Market1501mAP48.8%
应用类别方差数据集性能指标结果
分类差异DAN [161]Cityscapes \( \rightarrow \) 雾天Cityscapes平均准确率91.85%
差异ML-ANet [161]Cityscapes \( \rightarrow \) 雾天Cityscapes平均准确率94.83%
差异MCD [163]VisDA 2017 \( \rightarrow \) MSCOCO平均准确率71.9%
差异SWD [163]VisDA 2017 \( \rightarrow \) MSCOCO平均准确率76.4%
差异D-MMD [175]HSD(晴天 \( \rightarrow \) 多云/雨天/雪天)平均准确率72.63%
对抗STAR [175]HSD(晴天 \( \rightarrow \) 多云/雨天/雪天)平均准确率81.25%
对抗DWL [175]HSD(晴天 \( \rightarrow \) 多云/雨天/雪天)平均准确率82.38%
对抗SADA [175]\( \mathrm{{HSD}}\left( {\mathrm{{Sunny}} \rightarrow \mathrm{{Cloudy}}/\mathrm{{Rainy}}/\mathrm{{Snowy}}}\right) \)平均准确率93.20%
目标检测聚类CFFA [153]Cityscapes \( \rightarrow \) 雾天CityscapesmAP38.6%
差异DefDETR [159]Cityscapes \( \rightarrow \) 雾天CityscapesmAP28.5%
对抗MTTrans [159]Cityscapes \( \rightarrow \) 雾天CityscapesmAP43.4%
差异MRT [159]Cityscapes \( \rightarrow \) 雾天CityscapesmAP51.2%
差异可变形DETR [160]Cityscapes \( \rightarrow \) 雾天CityscapesmAP28.6%
对抗SFA [160]Cityscapes \( \rightarrow \) 雾天CityscapesmAP41.3%
对抗\( {O}^{2} \) 网络 [160]Cityscapes \( \rightarrow \) 雾天CityscapesmAP46.8%
差异TDOD(无适应)[164]DGTA(晴朗 \( \rightarrow \) 阴天)AP5090.3%
差异TDOD(有适应)[164]DGTA(晴朗 \( \rightarrow \) 阴天)AP5093.1%
对抗DaytoNight-无增强 [168]BDD(日间 \( \rightarrow \) 夜间)mAP55.3%
对抗DaytoNight-有增强 [168]BDD(日间 \( \rightarrow \) 夜间)mAP57.2%
对抗AFAN [170]CityScapes \( \rightarrow \) KITTImAP41.4%
对抗FogAndRainDA [171]CityScapes \( \rightarrow \) 雨天CityscapesmAP45.0%
对抗YOLOv4+CycleGAN [172]HVFD(正常 \( \rightarrow \) 雾天)mAP5067.21%
对抗YOLOv4+CPGAN [172]HVFD(正常 \( \rightarrow \) 雾天)mAP5069.24%
对抗MGA [173]SIM \( {10}\mathrm{\;K} \rightarrow \) CityscapesAP5049.8%
对抗DAAF [173]SIM \( {10}\mathrm{\;K} \rightarrow \) CityscapesAP5053.4%
对抗C2FDA [177]Cityscapes \( \rightarrow \) 雾天CityscapesmAP40.5%
对抗SAD [178]Cityscapes \( \rightarrow \) 雾天CityscapesmAP45.2%
差异MTOR [179]虚拟KITTI \( \rightarrow \) KITTImAP32.75%
对抗ParaTeacher [179]虚拟KITTI \( \rightarrow \) KITTImAP44.59%
分割聚类FFREEDA [150]GTA5 \( \rightarrow \) MapillarymloU\( {40.16} \pm {1.02} \)
差异SWD [163]GTA5 \( \rightarrow \) CityscapesmIoU44.5%
对抗DaytoNight-无增强 [168]BDD(日间 \( \rightarrow \) 夜间)mloU59.5%
对抗DaytoNight-有增强 [168]BDD(日间 \( \rightarrow \) 夜间)mIoU61.6%
对抗AdaptSegNet [169]GTA5 \( \rightarrow \) CityscapesmloU32.49%
对抗基于DRN的DA [169]GTA5 \( \rightarrow \) CityscapesmloU37.35%
对抗城市场景的UDA [174]GTA5 \( \rightarrow \) CityScapesmloU30.2%
对抗MTKT [176]ACDC(晴天\( \rightarrow \)多云/雨天/雪天)交并比(IoU)71.01%
对抗LSA-UDA [176]ACDC(晴天\( \rightarrow \)多云/雨天/雪天)交并比(IoU)76.13%
对抗仅特征的CyCADA [181]SYNTHIA \( \rightarrow \) CityScapesmIoU31.7%
对抗仅像素的CyCADA [181]SYNTHIA \( \rightarrow \) CityScapesmloU37.0%
对抗像素+特征的CyCADA [181]SYNTHIA \( \rightarrow \) CityScapesmloU39.5%
对抗FREDOM [183]GTA5 \( \rightarrow \) CityScapesmIoU73.6%
12IT对抗UNIT [148]Apollo(雾霾\( \rightarrow \)去雾)峰值信噪比(PSNR),结构相似性指数(SSIM)24.52, 0.85
对抗CycleGAN [148]Apollo(雾霾\( \rightarrow \)去雾)峰值信噪比(PSNR),结构相似性指数(SSIM)25.19, 0.89
对抗变分自编码器生成对抗网络(VAE-GAN)[148]Apollo(雾霾\( \rightarrow \)去雾)峰值信噪比(PSNR),结构相似性指数(SSIM)27.38, 0.93
对抗\( \mathrm{{AugGAN}}\left\lbrack {168}\right\rbrack \)BDD(日间 \( \rightarrow \) 夜间)弗雷歇特征距离(FID)67.07
对抗SemGAN [168]BDD(日间 \( \rightarrow \) 夜间)弗雷歇特征距离(FID)39.91
对抗昼夜转换(DaytoNight)[168]BDD(日间 \( \rightarrow \) 夜间)弗雷歇特征距离(FID)39.26
对抗CycleGAN [168]BDD(日间 \( \rightarrow \) 夜间)弗雷歇特征距离(FID)35.28
对抗MUNIT+分割(Seg)[180]增强版KAIST-MSBDD(白天\( \rightarrow \)夜晚)弗雷歇特征距离(FID)98.7
对抗BicycleGAN+分割(Seg)[180]增强版KAIST-MSBDD(白天\( \rightarrow \)夜晚)弗雷歇特征距离(FID)97.9
对抗SCGAN [180]增强版KAIST-MSBDD(白天\( \rightarrow \)夜晚)弗雷歇特征距离(FID)92.4
对抗TSIT+分割(Seg)[180]增强版KAIST-MSBDD(白天\( \rightarrow \)夜晚)弗雷歇特征距离(FID)80.8
对抗INIT [180]增强版KAIST-MSBDD(白天\( \rightarrow \)夜晚)弗雷歇特征距离(FID)76.7
对抗全景生成对抗网络(PanopticGAN)[180]增强版KAIST-MSBDD(白天\( \rightarrow \)夜晚)弗雷歇特征距离(FID)69.4
对抗CycleGAN [182]BDD(白天\( \rightarrow \)夜晚)弗雷歇特征距离(FID)35.52
对抗SemGAN [182]BDD(日间 \( \rightarrow \) 夜间)弗雷歇特征距离(FID)35.26
对抗AugGAN [182]BDD(日间 \( \rightarrow \) 夜间)弗雷歇特征距离(FID)57.72
对抗UNIT [182]BDD(日间 \( \rightarrow \) 夜间)弗雷歇特征距离(FID)32.66
对抗MUNIT [182]BDD(日间 \( \rightarrow \) 夜间)弗雷歇特征距离(FID)69.97
对抗SGND [182]BDD(日间 \( \rightarrow \) 夜间)弗雷歇特征距离(FID)31.25
行人重识别(Person Re-ID)聚类CDCL [149]\( \mathrm{{DukeMTMC} - {ReID}} \rightarrow \mathrm{{Market1501}} \)mAP81.5%
聚类DMD [150]DukeMTMC-reID \( \rightarrow \) Market1501mAP92.7%
差异D-MMD [162]DukeMTMC \( \rightarrow \) Market1501mAP48.8%

In semantic segmentation, the adversarial framework FRE-DOM [183] employs transformer-based networks, requiring careful HPO of hyperparameters such as learning rates (set at 2.5×104 ),momentum (0.9),weight decay (104) ,and batch size ( 4 per GPU) to balance fairness objectives with segmentation accuracy. With these calibrated hyperparame-ters, FREDOM achieves notable improvements, with mIoU scores reaching 67.0% for the SYNTHIA to Cityscapes, and 73.6% for the GTA5 to Cityscapes benchmarks, respectively, showcasing its capabilities in achieving class balance across complex traffic environments.

在语义分割中,对抗框架FRE-DOM [183]采用基于变换器(transformer-based)的网络,需对学习率(设定为2.5×104)、动量(0.9)、权重衰减(104)和批量大小(每GPU 4)等超参数进行精细调优,以平衡公平性目标与分割精度。通过这些校准的超参数,FRE-DOM取得显著提升,在SYNTHIA到Cityscapes的迁移任务中mIoU达到67.0%,在GTA5到Cityscapes的基准测试中达到73.6%,展示了其在复杂交通环境中实现类别平衡的能力。

For day-to-night translation tasks, adversarial models based on CycleGAN [137], such as the model in [168], benefit significantly from HPO, particularly through tuning the cyclic consistency loss parameter (λ ,set between 5 and 10) and generator learning rates (between 0.0001 and 0.0002). These optimizations improved mAP from 55.3% to 57.2% and mIoU from 59.5% to 61.6% on the BDD dataset. Similarly, CPGAN [172] utilizes HPO for optimizing generator and discriminator learning rates, as well as feature alignment loss weights,achieving an mAP50 of 69.24% on the HVFD dataset.

对于昼夜转换任务,基于CycleGAN [137]的对抗模型,如文献[168]中的模型,通过超参数优化(HPO)显著受益,特别是在调整循环一致性损失参数(λ(设定在5到10之间)和生成器学习率(介于0.0001到0.0002之间)方面。这些优化使BDD数据集上的mAP从55.3%提升至57.2%,mIoU从59.5%提升至61.6%。类似地,CPGAN [172]利用HPO优化生成器和判别器的学习率及特征对齐损失权重,在HVFD数据集上实现了mAP50为69.24%的成绩。

E. COMPARISON OF DOMAIN ADAPTATION MACHINE LEARNING MODELS

E. 域适应机器学习模型比较

A comparison of different categories of DA models is presented in Table 4. In classification tasks, discrepancy-based methods such as D-MMD [175] achieved a mean accuracy of 72.63% on the HSD dataset for sunny to cloudy/rainy/snowy weather. Successive models like STAR [175] and DWL [175] improved the mean accuracy to 81.25% and 82.38% ,respectively. The SADA model [175] further enhanced performance, reaching a mean accuracy of 93.20% ,demonstrating the effectiveness of self-adaptive adversarial approaches in handling domain shifts due to weather variations. Discrepancy-based methods, like DAN [161] and ML-ANet [161], also showed high performance on the Cityscapes to Foggy Cityscapes,with mean accuracies of 91.85% and 94.83%, respectively.

表4展示了不同类别域适应(DA)模型的比较。在分类任务中,基于差异的算法如D-MMD [175]在HSD数据集(晴天到多云/雨天/雪天)上实现了平均准确率为72.63%。后续模型如STAR [175]和DWL [175]分别将平均准确率提升至81.25%82.38%。SADA模型[175]进一步提升性能,达到平均准确率93.20%,证明了自适应对抗方法在应对因天气变化引起的域偏移方面的有效性。基于差异的方法如DAN [161]和ML-ANet [161]在Cityscapes到雾天Cityscapes的任务中也表现优异,平均准确率分别为91.85%和94.83%。

For object detection, the I2IT adversarial framework for day-to-night transformation [168] achieved an mAP of 55.3% on the BDD dataset without augmentation,while the inclusion of data augmentation improved the mAP to 57.2% . This improvement highlights the benefit of data augmentation within an adversarial-based DA framework. The adversarial-based teacher-student [156] framework using CPGAN [172] on the HVFD dataset achieved an mAP50 of 69.24% ,outperforming the version with CycleGAN, which achieved 67.21%. Other teacher-student [156] models, including MTOR [179] and ParaTeacher [179], achieved mAPs of 32.75% and 44.59%, respectively, on the Virtual KITTI to KITTI adaptation. Domain adaptation for DefDETR [159] achieved an mAP of 28.5% when adapting from Cityscapes to Foggy Cityscapes. In comparison, adversarial-based methods such as MTTrans [159] and MRT [159] significantly improved the mAP to 43.4% and 51.2% ,respectively,demonstrating the efficacy of transformer-based approaches in minimizing domain discrepancies. The AFA framework with AFAN [170] attained an mAP of 41.4% when adapting from Cityscapes to KITTI, while FogAndRainDA [171] reached an mAP of 45.0% from Cityscapes to Rainy Cityscapes.Transformer-based methods, including Deformable DETR [160], SFA [160], and O2 net [160],achieved mAPs of 28.6%,41.3% ,and 46.8% ,respectively,on Cityscapes to Foggy CityScapes. Based on these results, addressing the domain gap between traffic object detection datasets is a worthwhile direction for improving the generalization capacity of computer vision models.

在目标检测方面,基于I2IT的昼夜转换对抗框架[168]在BDD数据集上未使用增强时实现了mAP为55.3%,而加入数据增强后mAP提升至57.2%。这一提升凸显了数据增强在对抗域适应框架中的益处。基于CPGAN [172]的对抗教师-学生框架[156]在HVFD数据集上实现了mAP50为69.24%,优于使用CycleGAN版本的67.21%。其他教师-学生模型[156],包括MTOR [179]和ParaTeacher [179],在Virtual KITTI到KITTI的适应任务中分别取得了32.75%和44.59%的mAP。DefDETR [159]在Cityscapes到雾天Cityscapes的适应中实现了mAP为28.5%。相比之下,基于对抗的方法如MTTrans [159]和MRT [159]显著提升mAP至43.4%和51.2%,展示了基于变换器方法在减少域差异方面的有效性。AFA框架结合AFAN [170]在Cityscapes到KITTI的适应中达到41.4%的mAP,而FogAndRainDA [171]在Cityscapes到雨天Cityscapes的适应中达到45.0%。基于变换器的方法,包括Deformable DETR [160]、SFA [160]和O2网络[160],在Cityscapes到雾天Cityscapes的任务中分别实现了28.6%,41.3%46.8%的mAP。基于这些结果,解决交通目标检测数据集之间的域差距是提升计算机视觉模型泛化能力的有价值方向。

In segmentation tasks, the adversarial I2IT framework with DaytoNight [168] achieved an mIoU of 59.5% on the BDD dataset without augmentation,and 61.6% with augmentation. CYCADA [181], applied from SYNTHIA to CityScapes, achieved an mIoU of 39.5%, demonstrating the effectiveness of combining pixel and feature-level adaptation. UDAofUrbanScenes [174] achieved an mIoU of 30.2% for GTA5 to CityScapes benchmark. The adversarial teacher-student [156] framework using AdaptSegNet [169] achieved an mIoU of 32.49% on GTA5 to Cityscapes, while DRN-D-BasedDA [169] improved to 37.35%. The AFA framework using LSA-UDA [176] achieved an IoU of 76.13% on the ACDC dataset for sunny to cloudy/rainy/snowy), highlighting the potential of adversarial feature alignment in handling varying weather conditions. Leveraging the transformer network framework [188], FREDOM [183] achieved an mIoU of 73.6% on GTA5 to CityScapes,demonstrating the effectiveness of transformer-based architectures in DA for segmentation tasks.

在分割任务中,采用DaytoNight [168]的对抗性图像到图像转换(I2IT)框架在BDD数据集上未增强时实现了59.5%的mIoU,增强后达到61.6%。CYCADA [181]从SYNTHIA到CityScapes的应用实现了39.5%的mIoU,展示了像素级和特征级适应结合的有效性。UDAofUrbanScenes [174]在GTA5到CityScapes基准上实现了30.2%的mIoU。采用AdaptSegNet [169]的对抗性师生框架 [156]在GTA5到CityScapes上实现了32.49%的mIoU,而基于DRN-D的DA [169]提升至37.35%。使用LSA-UDA [176]的AFA框架在ACDC数据集上针对晴天到多云/雨天/雪天的转换实现了76.13%的IoU,凸显了对抗性特征对齐在处理不同天气条件下的潜力。利用变换器网络框架 [188],FREDOM [183]在GTA5到CityScapes上实现了73.6%的mIoU,展示了基于变换器架构在分割任务领域自适应(DA)中的有效性。

For I2IT, the adversarial VAE-GAN [148] achieved a PSNR of 27.38 and an SSIM of 0.93 on the Apollo traffic scene dataset (Haze to Dehaze), outperforming UNIT and CycleGAN on the same dataset. In one study [168], CycleGAN achieved an FID of 35.28 on the BDD dataset for the day-to-night task, while DaytoNight, SemGAN, and AugGAN reported FID scores of 39.26, 39.91, and 67.07, respectively. In a separate study [182], CycleGAN attained an FID of 35.52, whereas SemGAN slightly improved to 35.26. Additionally, AugGAN and MUNIT produced FID scores of 57.72 and 69.97, respectively. Notably, the SGND model achieved the lowest FID score of 31.25, suggesting superior image quality for the generated scenes compared to the other models.

对于图像到图像转换(I2IT),对抗性VAE-GAN [148]在Apollo交通场景数据集(雾霾到去雾)上实现了27.38的峰值信噪比(PSNR)和0.93的结构相似性指数(SSIM),优于同一数据集上的UNIT和CycleGAN。在一项研究中 [168],CycleGAN在BDD数据集的昼夜转换任务中实现了35.28的FID,而DaytoNight、SemGAN和AugGAN分别报告了39.26、39.91和67.07的FID分数。在另一项研究 [182]中,CycleGAN获得了35.52的FID,SemGAN略微提升至35.26。此外,AugGAN和MUNIT分别产生了57.72和69.97的FID分数。值得注意的是,SGND模型实现了最低的31.25的FID分数,表明其生成场景的图像质量优于其他模型。

In Person Re-ID, clustering-based methods showed significant improvements. CDCL [149] achieved an mAP of 81.5% when adapting from DukeMTMC-ReID to Market1501. DMD [150] further improved the mAP to 92.7% on the same dataset pair, indicating the effectiveness of clustering techniques in handling domain shifts for Person Re-ID tasks. Discrepancy-based methods like D-MMD [162] achieved a lower mAP of 48.8%, suggesting that clustering methods may be more suitable for this application.

在人脸重识别(Person Re-ID)中,基于聚类的方法表现出显著提升。CDCL [149]在从DukeMTMC-ReID到Market1501的适应中实现了81.5%的mAP。DMD [150]在相同数据集对上进一步将mAP提升至92.7%,表明聚类技术在处理Person Re-ID任务中的域转移问题上效果显著。基于差异的方法如D-MMD [162]实现了较低的48.8% mAP,暗示聚类方法可能更适合该应用。

The results demonstrate that the choice of DA method significantly impacts performance across various applications. Clustering-based methods are particularly strong in Person Re-ID,with DMD [150] achieving a high mAP of 92.7%. Discrepancy-based methods are effective in classification and object detection, exemplified by ML-ANet [161] with 94.83% accuracy in classification. Adversarial-based methods perform well across multiple tasks, including classification, object detection, segmentation, and I2IT, with models like SADA [175] achieving 93.20% accuracy and SGND [182] reaching the best FID of 31.25 in I2IT. Effectively addressing domain gaps is crucial for enhancing model generalization in various applications. Addressing the domain gap effectively is crucial for enhancing the generalization capacity of ML models in various applications.

结果表明,选择不同的域自适应(DA)方法对各类应用的性能有显著影响。基于聚类的方法在Person Re-ID中表现尤为突出,DMD [150]实现了92.7%的高mAP。基于差异的方法在分类和目标检测中效果显著,如ML-ANet [161]在分类中达到94.83%的准确率。基于对抗的方法在分类、目标检测、分割和图像到图像转换(I2IT)等多任务中表现良好,模型如SADA [175]实现了93.20%的准确率,SGND [182]在I2IT中获得了最佳的31.25 FID。有效解决域间差异对于提升模型在各类应用中的泛化能力至关重要。

VI. DISCUSSION

六、讨论

In this section, we examine the key features of deep learning models for traffic scene understanding, highlighting their strengths, limitations, and potential areas for enhancement. The discussion covers discriminative, generative, and DA models. Table 5 summarizes the shortcomings and potential future directions for improvement across these categories, providing an overview to guide further research.

本节我们探讨交通场景理解中深度学习模型的关键特性,重点分析其优势、局限及潜在的改进方向。讨论涵盖判别模型、生成模型和域自适应模型。表5总结了这些类别的不足及未来改进方向,为后续研究提供指导。

A. DISCRIMINATIVE MODELS

A. 判别模型

This subsection discusses the discriminative models, emphasizing their role in traffic scene understanding by examining their strengths and limitations, paving the way for future advancements and potential research directions.

本小节讨论判别模型,强调其在交通场景理解中的作用,分析其优势与局限,为未来的进展和研究方向奠定基础。

1) CNN

1) 卷积神经网络(CNN)

Advantages:

优势:

Disadvantages:

劣势:

Future Work: Future research should focus on reducing reliance on large labeled datasets through semi-supervised and self-supervised learning, improving CNN generalization. Adding context-aware modules like attention mechanisms or non-local operations can help capture global dependencies in traffic scenes, boosting performance in complex environments without the computational cost of transformers.

未来工作:未来研究应着重于通过半监督和自监督学习减少对大规模标注数据集的依赖,提升CNN的泛化能力。引入注意力机制或非局部操作等上下文感知模块,有助于捕捉交通场景中的全局依赖,从而在复杂环境中提升性能,同时避免变换器(transformers)带来的计算开销。

2) VANILLA R-CNN

2) 原始R-CNN

Advantages:

优势:

Disadvantages:

劣势:

Future Work: For future work, we largely expect efforts to focus on the more developed Vanilla R-CNN variants. One possible direction specifically for R-CNN is the development of more efficient algorithms for generating region proposals that can be integrated seamlessly into the R-CNN pipeline. Furthermore, improvements to the ROI pooling procedure could enhance performance, particularly for small-object detection, which would be especially beneficial for certain traffic scene processing tasks such as aerial traffic tracking. Enhancements should also focus on better handling occlusions by incorporating more sophisticated feature extraction techniques that can capture partially hidden objects effectively.

未来工作:未来工作预计将主要聚焦于更成熟的原始R-CNN变体。针对R-CNN的一个可能方向是开发更高效的区域提议算法,能够无缝集成到R-CNN流程中。此外,改进ROI池化过程可提升性能,尤其是小目标检测,这对某些交通场景处理任务(如空中交通跟踪)尤为有益。增强措施还应侧重于通过更复杂的特征提取技术更好地处理遮挡,能够有效捕捉部分隐藏的目标。

3) FAST R-CNN

3) 快速R-CNN

Advantages:

优势:

Disadvantages:

缺点:

Future Work: Although Faster R-CNN has addressed several limitations of Fast R-CNN, R-CNN could still benefit from advancements in representational learning by leveraging modern architectures such as ViTs.

未来工作:尽管Faster R-CNN解决了Fast R-CNN的若干局限,R-CNN仍可通过利用现代架构如ViTs(视觉变换器)在表征学习方面获益。

4) FASTER R-CNN

4) Faster R-CNN

Advantages:

优势:

Disadvantages:

缺点:

Future Work: Future work could investigate the development of lighter, more resource-efficient Faster R-CNN variants that retain high detection performance while enabling deployment in real-time traffic applications. Additionally, enhancing small-object detection capabilities in Faster R-CNN represents another promising research direction.

未来工作:未来的研究可以探索开发更轻量、更节省资源的Faster R-CNN变体,在保持高检测性能的同时,实现实时交通应用的部署。此外,提升Faster R-CNN在小目标检测方面的能力也是一个有前景的研究方向。

5) MASK R-CNN

5) MASK R-CNN

Advantages:

优势:

Disadvantages:

劣势:

Future Work: Future research should optimize the mask prediction branch to reduce computational overhead while maintaining high accuracy. Enhancing small-object segmentation with multi-scale fusion and advanced attention mechanisms, along with robust algorithms for occlusions using 3D spatial data or multi-view inputs, could further advance Mask R-CNN.

未来工作:未来研究应优化掩码预测分支以降低计算开销,同时保持高精度。通过多尺度融合和先进的注意力机制提升小目标分割能力,结合利用三维空间数据或多视角输入的鲁棒遮挡处理算法,有望进一步推动Mask R-CNN的发展。

6) YOLO

6) YOLO

Advantages:

优势:

Disadvantages:

缺点:

Future Work: Future research should enhance YOLO's detection of small and overlapping objects by refining its grid-based approach, advancing post-processing techniques, and developing more versatile backbones. Efforts should also focus on improving adaptability to challenging environments like adverse weather or crowded scenes and optimizing detection through anchor-free architectures and advanced attention mechanisms.

未来工作:未来研究应通过优化基于网格的方法、提升后处理技术及开发更通用的骨干网络,增强YOLO对小物体和重叠物体的检测能力。同时,应着力提升其对恶劣天气或拥挤场景等复杂环境的适应性,并通过无锚框架和先进的注意力机制优化检测性能。

7) ViT

7) ViT

Advantages:

优点:

Disadvantages:

缺点:

Future Work: Future research should prioritize reducing the computational cost of ViT models and their dependence on large datasets. As advancements are made in these areas, ViTs could be explored as viable alternatives for general computer vision tasks in traffic scenes, where CNNs currently prevail. Furthermore, enhancing ViTs' ability to generalize with limited annotated data through transfer learning and data-efficient training techniques could help mitigate overfitting and improve their applicability to traffic scene tasks with scarce data.

未来工作:未来的研究应优先减少ViT(视觉Transformer)模型的计算成本及其对大规模数据集的依赖。随着这些领域的进展,ViT有望作为交通场景中通用计算机视觉任务的可行替代方案,目前这些任务主要由卷积神经网络(CNN)主导。此外,通过迁移学习和数据高效训练技术提升ViT在有限标注数据下的泛化能力,有助于缓解过拟合问题,增强其在数据稀缺的交通场景任务中的适用性。

8) DETR

8) DETR

Advantages:

优势:

Disadvantages:

劣势:

Future Work: To address DETR's slow convergence, future work could focus on developing more efficient training paradigms that reduce the number of epochs required for convergence. Another promising research direction involves improving the detection of smaller or occluded objects in traffic scenes by modifying the attention mechanism to better capture fine details. Reducing computational overhead to enable real-time deployment is also crucial, as is exploring more flexible approaches for controlling the maximum number of detected objects.

未来工作:为解决DETR训练收敛慢的问题,未来研究可聚焦于开发更高效的训练范式,减少收敛所需的迭代次数。另一个有前景的方向是通过改进注意力机制,更好地捕捉细节,从而提升对交通场景中小目标或遮挡物的检测能力。降低计算开销以实现实时部署同样至关重要,同时探索更灵活的检测目标数量控制方法也是研究重点。

9) GNN

9) GNN

Advantages:

优势:

Disadvantages:

缺点:

Future Work: Future research should focus on efficient graph construction for dynamic traffic scenes, including adaptive real-time updates and robust occlusion handling. Vision GNNs can model spatial relationships and reconstruct occluded features. Reducing GNN training complexity and exploring hybrid approaches with self-supervised learning or multi-modal fusion can further improve robustness and efficiency.

未来工作:未来研究应聚焦于动态交通场景中高效的图构建,包括自适应实时更新和鲁棒的遮挡处理。视觉GNNs能够建模空间关系并重建被遮挡特征。降低GNN训练复杂度,探索与自监督学习或多模态融合的混合方法,可进一步提升鲁棒性和效率。

10) CapsNet

10) 胶囊网络(CapsNet)

Advantages:

优点:

Disadvantages:

缺点:

Future Work: A promising direction for future research is to optimize the routing-by-agreement mechanism to reduce computational complexity while preserving the model's ability to capture spatial hierarchies. Such improvements would enhance the practicality of CapsNets for real-world traffic scene understanding.

未来工作:未来研究的一个有前景的方向是优化基于协议的路由机制,以降低计算复杂度,同时保持模型捕捉空间层次结构的能力。这类改进将提升胶囊网络(CapsNets)在实际交通场景理解中的实用性。

B. GENERATIVE MODELS

B. 生成模型

This subsection explores the generative models, highlighting their significance in traffic scene understanding by analyzing their advantages and challenges, while also suggesting potential future improvements and research opportunities.

本小节探讨生成模型,强调其在交通场景理解中的重要性,通过分析其优势与挑战,同时提出潜在的未来改进和研究机会。

1) GAN

1) 生成对抗网络(GAN)

Advantages:

优势:

Disadvantages:

劣势:

Future Work: In the future, researchers should explore mechanisms to address training instability, such as developing more robust optimization methods or hybrid models that integrate GANs with more stable generative frameworks. Incorporating regularization techniques into the objective function may also help mitigate mode collapse, making GANs more suitable for generating realistic synthetic traffic scene data.

未来工作:未来研究应探索解决训练不稳定的机制,如开发更鲁棒的优化方法或将GAN与更稳定的生成框架结合的混合模型。将正则化技术纳入目标函数也可能有助于缓解模式崩溃,使GAN更适合生成逼真的合成交通场景数据。

2) cGAN

2) 条件生成对抗网络(cGAN)

Advantages:

优势:

Disadvantages:

劣势:

Future Work: Future work could prioritize mitigating the overfitting risk in cGANs by developing more effective regularization techniques and enhancing the diversity of conditional data. Additionally, researchers could explore methods to reduce computational overhead by designing lightweight architectures better suited for real-time traffic scene generation.

未来工作:未来的研究可优先考虑通过开发更有效的正则化技术和增强条件数据的多样性来缓解cGANs的过拟合风险。此外,研究人员还可以探索设计更轻量级架构以减少计算开销,更适合实时交通场景生成的方法。

3) VAE

3) 变分自编码器(VAE)

Advantages:

优势:

Disadvantages:

劣势:

Future Work: In the future, researchers are expected to develop improved adversarial training paradigms to enhance the reconstruction quality of VAE models. Additionally, efforts should focus on designing optimization mechanisms that automatically address the trade-off between accurate reconstruction and diverse data generation.

未来工作:未来研究预计将开发改进的对抗训练范式以提升VAE模型的重构质量。同时,应致力于设计自动调节准确重构与多样数据生成权衡的优化机制。

C. DOMAIN ADAPTATION MODELS

C. 域适应模型

This subsection examines the DA models, focusing on their contributions to traffic scene understanding by evaluating their benefits and limitations, and outlining avenues for future research and development.

本小节探讨域适应(DA)模型,重点评估其在交通场景理解中的贡献,分析其优缺点,并概述未来研究与发展的方向。

1) CLUSTERING-BASED DOMAIN ADAPTATION

1) 基于聚类的域适应

Advantages:

优势:

Disadvantages:

劣势:

Future Work: In the future, we expect research to focus on improving the robustness of cluster formation in noisy or overlapping domains, enhancing scalability for large and complex datasets, and exploring novel approaches to integrating clustering with other DA methods to improve performance across diverse tasks.

未来工作:未来研究预计将聚焦于提升噪声或重叠领域中聚类形成的鲁棒性,增强对大规模复杂数据集的可扩展性,并探索将聚类与其他领域自适应方法结合的新途径,以提升多样任务的性能。

2) DISCREPANCY-BASED DOMAIN ADAPTATION Advantages:

2) 基于差异的领域自适应 优势:

Disadvantages:

劣势:

TABLE 5. Summary of shortcomings and future directions for improvement in Discriminative, Generative, and DA models.

表5. 判别式、生成式及领域自适应模型的缺点总结及改进方向。

CategoryFrameworkLimitationsFuture Works
DiscriminativeCNN- Data dependency - Generalization issues - Limited global contextual understanding- Explore semi-supervised and self-supervised learning methods - Integrate attention mechanisms for global dependencies
Vanilla R-CNN- Inefficient region proposal strategy - High Memory Consumption - Lack of End-to-End Training - Challenges with Occlusions- Develop efficient region proposal algorithms - Improve the ROI pooling procedure - Develop techniques for improved handling of occluded objects
Fast R-CNN- Inefficient Region Proposal Strategy - Lower Small-Object Detection Accuracy- Leverage Faster R-CNN to improve region proposal efficiency - Enhance representational learning with advanced networks like ViT
Faster R-CNN- Limited real-time performance - Being resource-intensive - Lower small-object detection accuracy- Develop lighter, more resource-efficient variants - Improve small-object detection
Mask R-CNN- High Computational Demand - Small Object Segmentation Issues - Training Complexity- Optimize mask prediction for efficiency - Improve small-object segmentation with multi-scale fusion - Handle occlusions using 3D or multi-view data
YOLO- Difficulty with small objects - Challenges with overlapping Objects - Accuracy trade-offs - Sensitivity to Occlusions- Refine grid-based approach and develop post-processing techniques - Improve adaptability to challenging environments - Enhance occlusion handling with robust feature extraction
ViT- Heavy data and computation - Overfitting risk - Training complexity- Reduce computational costs - Improve transfer learning and data-efficient training methods
DETR- Slow training convergence - Difficulty with small and occluded ob- jects - Computational overhead- Develop more efficient training paradigms - Improve small-object detection - Reduce computational overhead for real-time deployment
GNN- High computational demand - Data preprocessing complexity - Training complexity- Develop graph construction process for images - Reduce training complexity - Leverage ViGs for handling occlusions and reconstructing object features - Explore hybrid approaches with self-supervised learning and multi- modal fusion
CapsNet- Computational complexity - Scalability issues- Optimize routing-by-agreement mechanism - Enhance scalability for real-world applications
GenerativeGAN- Training instability - Mode collapse risk- Develop robust optimization methods - Explore hybrid models and regularization techniques
cGAN- Training complexity - Risk of conditional overfitting - Higher resource requirements- Improve regularization techniques - Reduce computational overhead with lightweight architectures
VAE- Blurry reconstructions - Reconstruction vs. generation trade-off- Enhance reconstruction quality through adversarial training - Balance accurate reconstruction and diverse generation
DAClustering- Sensitivity to Cluster Quality - Difficulty in Handling Domain Overlap - Scalability Challenges - Reliance on Pseudo-Labels and Proto- types- Improve Robustness of Cluster Formation in Noisy Domains - Enhance Scalability for Large Complex Datasets - Integrate Clustering with Other DA Methods
Discrepancy- Dependence on Metric Choice - Limited Adaptation to Complex Shifts - Sensitivity to Feature Representation- Develop Flexible Discrepancy Metrics for Complex Domain Shifts - Improve Feature Representation Techniques - Combine Discrepancy-Based Methods with Other Adaptation Strate- gies
Adversarial- Training Instability - Mode Collapse Risk - Sensitive to Hyperparameters- Improve Stability of Adversarial Training - Address Mode Collapse Issues - Develop Robust Hyperparameter Tuning Approaches
类别框架局限性未来工作
判别式CNN(卷积神经网络)- 数据依赖性 - 泛化能力问题 - 有限的全局上下文理解- 探索半监督和自监督学习方法 - 融入注意力机制以捕捉全局依赖
基础R-CNN- 区域提议策略效率低 - 高内存消耗 - 缺乏端到端训练 - 遮挡问题挑战- 开发高效的区域提议算法 - 改进ROI池化过程 - 研发更好处理遮挡物体的技术
Fast R-CNN- 区域提议策略效率低 - 小目标检测准确率较低- 利用Faster R-CNN提升区域提议效率 - 采用ViT(视觉Transformer)等先进网络增强表征学习
Faster R-CNN- 实时性能有限 - 资源消耗大 - 小目标检测准确率较低- 开发更轻量、资源高效的变体 - 改进小目标检测
Mask R-CNN- 计算需求高 - 小目标分割问题 - 训练复杂- 优化掩码预测以提升效率 - 通过多尺度融合改善小目标分割 - 利用三维或多视角数据处理遮挡
YOLO(你只看一次)- 小目标检测困难 - 重叠物体挑战 - 准确率权衡 - 对遮挡敏感- 优化基于网格的方法并开发后处理技术 - 提升对复杂环境的适应性 - 通过鲁棒特征提取增强遮挡处理
ViT(视觉Transformer)- 数据和计算量大 - 过拟合风险 - 训练复杂- 降低计算成本 - 改进迁移学习和数据高效训练方法
DETR(端到端目标检测器)- 训练收敛慢 - 小目标和遮挡物体检测困难 - 计算开销大- 开发更高效的训练范式 - 改进小目标检测 - 降低实时部署的计算开销
GNN(图神经网络)- 计算需求高 - 数据预处理复杂 - 训练复杂- 开发图像图构建流程 - 降低训练复杂度 - 利用ViGs处理遮挡和重建物体特征 - 探索自监督学习与多模态融合的混合方法
CapsNet(胶囊网络)- 计算复杂度高 - 可扩展性问题- 优化路由协议机制 - 提升实际应用的可扩展性
生成式GAN(生成对抗网络)- 训练不稳定 - 模式崩溃风险- 开发稳健的优化方法 - 探索混合模型和正则化技术
cGAN(条件生成对抗网络)- 训练复杂 - 条件过拟合风险 - 资源需求较高- 改进正则化技术 - 通过轻量架构降低计算开销
VAE(变分自编码器)- 重建模糊 - 重建与生成的权衡- 通过对抗训练提升重建质量 - 平衡准确重建与多样化生成
DA(领域自适应)聚类- 对聚类质量敏感 - 处理领域重叠困难 - 可扩展性挑战 - 依赖伪标签和原型- 提升噪声领域中聚类形成的鲁棒性 - 增强大规模复杂数据集的可扩展性 - 将聚类与其他领域自适应方法结合
差异度- 依赖度量选择 - 对复杂域偏移适应有限 - 对特征表示敏感- 开发适应复杂域偏移的灵活差异度量 - 改进特征表示技术 - 将基于差异度的方法与其他自适应策略结合
对抗式- 训练不稳定 - 模式崩溃风险 - 对超参数敏感- 提升对抗训练的稳定性 - 解决模式崩溃问题 - 开发稳健的超参数调优方法

Future Work: In the future, researchers should explore developing more flexible discrepancy metrics that can handle complex domain shifts, improving feature representation techniques, and exploring hybrid approaches that combine discrepancy-based methods with other adaptation strategies to enhance robustness and generalization across diverse traffic scene understanding tasks.

未来工作:未来,研究人员应探索开发更灵活的差异度量方法,以应对复杂的领域迁移,改进特征表示技术,并探索将基于差异的方法与其他适应策略相结合的混合方法,以增强在多样化交通场景理解任务中的鲁棒性和泛化能力。

3) ADVERSARIAL-BASED DOMAIN ADAPTATION Advantages:

3) 基于对抗的领域自适应 优势:

Disadvantages:

缺点:

Future Work: Future research should aim to improve the stability of adversarial training, address mode collapse issues, and explore more robust approaches to hyperpa-rameter tuning to enhance the scalability and reliability of adversarial-based DA methods across a wider range of tasks.

未来工作:未来研究应致力于提升对抗训练的稳定性,解决模式崩溃问题,并探索更鲁棒的超参数调优方法,以增强基于对抗的领域自适应方法在更广泛任务中的可扩展性和可靠性。

VII. FUTURE RESEARCH AREAS

七、未来研究方向

While advancements in DL methods have significantly improved traffic scene understanding, further progress is possible. This section highlights key research topics for future work, emphasizing the need for more reliable, versatile, efficient, and scalable DL frameworks. The performance of these systems, particularly in complex real-world scenarios, hinges on the quality of underlying DL models. Future work should focus on developing models that address practical challenges like real-time performance and generalizability, along with holistic challenges such as integrating multi-modal data and enhancing model interpretability.

尽管深度学习方法的进步显著提升了交通场景理解,但仍有进一步发展的空间。本节重点介绍未来工作的关键研究主题,强调需要更可靠、多功能、高效且可扩展的深度学习框架。这些系统的性能,尤其是在复杂的现实场景中,依赖于底层深度学习模型的质量。未来工作应聚焦于开发能够解决实时性能和泛化能力等实际挑战的模型,以及整合多模态数据和提升模型可解释性等整体性挑战。

Exploring methodologies from diverse domains could boost the robustness and versatility of traffic scene understanding models. For example, disaster management techniques [189] may inspire innovative approaches to traffic analysis. Adapting successful strategies from other fields could improve scalability and reliability. Additionally, integrating real-time algorithms, like the simultaneous vehicle detection and tracking method for aerial videos [190], could enhance the speed and scalability of DL models for urban traffic scene understanding.

借鉴不同领域的方法论可能提升交通场景理解模型的鲁棒性和多样性。例如,灾害管理技术[189]可能为交通分析带来创新思路。借鉴其他领域的成功策略可提升模型的可扩展性和可靠性。此外,集成实时算法,如用于航拍视频的车辆检测与跟踪方法[190],可增强深度学习模型在城市交通场景理解中的速度和可扩展性。

A. XAI

A. 可解释人工智能(XAI)

Most computer vision-based deep learning frameworks for traffic scene understanding operate as black-box models, lacking simple or straightforward methods for assessing and interpreting their outputs. This raises concerns regarding the reliability and transparency of such systems, as well as the feasibility of deploying real-world applications that leverage traffic scene understanding for decision-making tasks, such as road safety and risk assessment [191]. The inability to justify decisions based on the vision component's output further complicates their deployment. Especially for downstream applications like autonomous driving, it is preferable to have some mechanism for evaluating how the deep model actually understands a traffic scene, which allows for a domain expert to identify gaps in a model's capabilities. XAI addresses this issue by providing techniques and methodologies-such as post-hoc explanation methods or inherently interpretable architectures-that clarify how inputs influence the model's output. Indeed, for real-world systems such as automated urban intervention systems, which aim to improve pedestrian and vehicle safety by leveraging DNNs for detection, tracking, and behavior prediction, researchers have recently proposed adopting XAI techniques to provide insights into traffic control, surveillance, and collision prevention for autonomous vehicles [192]. Recent research [191] has introduced the interpretability of NNs in traffic sign recognition systems to enhance road safety and optimize traffic management by leveraging XAI techniques like Local Interpretable Model-Agnostic Explanations (LIME) and Gradient-weighted Class Activation Mapping (Grad-CAM). LIME provides explainability by approximating the behavior of a model locally given some predictions, while Grad-CAM generates heat maps that show which regions of an image contribute most to a prediction based on the activated gradients within the deep layers. Moreover, the significance of scene understanding for autonomous vehicles in unstructured traffic environments is emphasized in [193], suggesting the use of models like the Inception U-Net with Grad-CAM visualization to enhance navigation in crowded traffic scenarios.

大多数基于计算机视觉的交通场景理解深度学习框架作为黑箱模型运行,缺乏简单直接的评估和解释其输出的方法。这引发了对系统可靠性和透明性的担忧,以及在实际应用中利用交通场景理解进行决策(如道路安全和风险评估[191])的可行性问题。基于视觉组件输出无法解释决策,进一步增加了部署难度。尤其对于自动驾驶等下游应用,最好具备某种机制来评估深度模型对交通场景的实际理解,以便领域专家识别模型能力的不足。可解释人工智能(XAI)通过提供技术和方法——如事后解释方法或内在可解释架构——阐明输入如何影响模型输出,解决了这一问题。实际上,对于旨在通过深度神经网络(DNN)实现检测、跟踪和行为预测以提升行人和车辆安全的自动化城市干预系统,研究者近期提出采用XAI技术,为交通控制、监控和自动驾驶车辆的碰撞预防提供洞见[192]。近期研究[191]引入了神经网络在交通标志识别系统中的可解释性,利用局部可解释模型无关解释(LIME)和梯度加权类激活映射(Grad-CAM)等XAI技术,提升道路安全和优化交通管理。LIME通过局部近似模型行为提供解释,而Grad-CAM基于深层激活梯度生成热力图,显示图像中对预测贡献最大的区域。此外,[193]强调了在非结构化交通环境中自动驾驶车辆场景理解的重要性,建议使用如Inception U-Net结合Grad-CAM可视化的模型,以增强拥挤交通场景下的导航能力。

While XAI has been applied to specific traffic computer vision tasks, significant limitations remain in performance and integration. Improvements are needed for methods like LIME and Grad-CAM, particularly as research shifts from CNN-based learning to ViT and GNN. Most XAI focuses on single-model outputs, overlooking complex systems like multi-target multi-camera tracking. Further exploration is required to integrate XAI into multi-modal systems, as demonstrated by a recent autonomous driving XAI system using multi-modal image captioning for decision-making justification [194]. This opens new possibilities for developing XAI systems that merge text and image data to interpret traffic systems' decision-making. Additionally, integrating XAI for real-time explainability could enhance insights in applications like traffic anomaly detection and object detection, improving robustness in challenging conditions, such as adverse weather segmentation [195].

虽然可解释人工智能(XAI)已应用于特定的交通计算机视觉任务,但在性能和集成方面仍存在显著限制。需要改进诸如LIME和Grad-CAM等方法,尤其是在研究从基于卷积神经网络(CNN)的学习转向视觉Transformer(ViT)和图神经网络(GNN)时。大多数XAI关注单一模型输出,忽视了多目标多摄像头跟踪等复杂系统。需要进一步探索将XAI集成到多模态系统中,正如最近一个利用多模态图像字幕生成进行决策解释的自动驾驶XAI系统所示[194]。这为开发融合文本和图像数据以解释交通系统决策的新型XAI系统开辟了新可能。此外,将XAI集成用于实时可解释性,能够增强交通异常检测和目标检测等应用中的洞察力,提高在恶劣天气分割等挑战条件下的鲁棒性[195]。

B. ENHANCING FEATURE REPRESENTATION WITH NOVEL MODELS

B. 利用新型模型增强特征表示

As discussed in this work, transformers and GNNs have gained increasing attention in recent years, with studies showing that ViT and deep GNNs can rival leading CNN architectures while often reducing computational demands [88], [195]. While CNNs have dominated feature extraction in computer vision for over a decade, emerging architectures offer promising opportunities for further advancement. ViT models, for example, require higher-quality data than CNNs due to their lack of inductive bias [196], [197], which historically made CNNs more robust to challenges like occlusion by enforcing local spatial coherence. Influenced by non-local neural networks, ViTs leverage global attention to better handle complex occlusions through long-range dependency modeling. Similarly, GNNs, traditionally limited by over-smoothing in shallow architectures [198], have seen breakthroughs enabling deeper models [199], [200]. Competitive vision GNNs (ViGs), such as the recent model by [195], now match CNN and ViT performance in tasks like classification and detection. GNNs excel at representing graph-structured data, making them effective for reconstructing occluded objects and reasoning about partially visible entities in traffic scenes. Self-supervised learning (SSL) also holds strong potential for these architectures, as methods like contrastive learning [155] enhance performance and robustness, helping mitigate occlusion by fostering holistic representations from incomplete or obstructed data.

如本文所述,近年来Transformer和图神经网络(GNN)受到越来越多关注,研究表明视觉Transformer(ViT)和深度GNN在性能上可与领先的CNN架构媲美,同时通常降低计算需求[88],[195]。尽管CNN在计算机视觉特征提取领域已主导十余年,新兴架构为进一步发展提供了有希望的机会。例如,ViT模型由于缺乏归纳偏置[196],[197],相比CNN需要更高质量的数据,CNN通过强制局部空间一致性在历史上对遮挡等挑战更具鲁棒性。受非局部神经网络影响,ViT利用全局注意力通过长距离依赖建模更好地处理复杂遮挡。同样,传统上受限于浅层架构过度平滑问题的GNN[198],已取得突破,支持更深层模型[199],[200]。竞争性视觉GNN(ViGs),如[195]最新模型,在分类和检测任务中已能匹敌CNN和ViT性能。GNN擅长表示图结构数据,因而在重建遮挡物体和推理交通场景中部分可见实体方面表现出色。自监督学习(SSL)对这些架构也具有强大潜力,诸如对比学习[155]等方法提升性能和鲁棒性,有助于通过从不完整或遮挡数据中构建整体表示来缓解遮挡问题。

C. REAL-TIME PROCESSING FOR COMPLEX TRAFFIC SCENE UNDERSTANDING

C. 复杂交通场景理解的实时处理

Critically, while many of the models discussed demonstrate strong performance in offline traffic vision tasks, such as object detection, they face significant challenges in real-time processing applications due to inefficiencies. For example, although YOLOv8, one of the most recent versions of the YOLO family and the latest one introduced in this paper, achieves high performance in traffic object detection, the variants still struggle with small-object detection, multi-scale object detection, and detection under adverse environmental conditions [201]. Recent studies have shown that transformer-based architectures can achieve significantly lower latency; however, these models still face difficulties with small-object detection and other challenging conditions [159]. In other domains, researchers have explored the combination of generative models, such as GANs [124], with ViTs [88] to address complex scenarios [202], though further research is necessary to mitigate the high computational costs associated with GANs. For complex traffic scene applications, particularly those involving multiple cameras and downstream decision agents, the underlying deep learning model must be both lightweight and capable of delivering high performance.

关键是,尽管许多讨论的模型在离线交通视觉任务(如目标检测)中表现优异,但由于效率问题,在实时处理应用中面临重大挑战。例如,YOLO系列最新版本YOLOv8在交通目标检测中表现出色,但其变体仍在小目标检测、多尺度目标检测及恶劣环境条件下检测方面存在困难[201]。近期研究表明,基于Transformer的架构可显著降低延迟;然而,这些模型在小目标检测及其他复杂条件下仍面临挑战[159]。在其他领域,研究者探索将生成模型如生成对抗网络(GAN)[124]与ViT结合以应对复杂场景[202],但仍需进一步研究以降低GAN的高计算成本。对于涉及多摄像头和下游决策代理的复杂交通场景应用,底层深度学习模型必须既轻量又能提供高性能。

D. ADDRESSING DATA LIMITATIONS WITH HIGH-QUALITY SYNTHETIC DATA

D. 利用高质量合成数据解决数据限制

Currently, many widely used datasets for traffic scene understanding tasks consist of synthetic data, such as SYN-THIA [203] and GTA5 [204], which feature automatically annotated images from traffic scenes created in the Unity Game Engine and the GTA 5 game environment. Generative AI and DA are commonly employed to address limitations in training data by generating augmented samples for rare or hard-to-capture scenarios, including occluded objects or accidents under adverse weather conditions [205]. Despite advancements, data from virtual simulations remains limited unless traffic objects behave realistically and scenes feature high-fidelity graphics comparable to real-world data. While some work has pursued generating photo-realistic traffic scenes for computer vision tasks [206], recent improvements in virtual engines enable much higher-quality synthetic data generation with automatic labeling at scale [207]. Enhanced rendering capabilities allow the simulation of diverse traffic scenarios, including challenging conditions like rain, snow, or occlusion-heavy nighttime settings. For tasks with limited data or imbalances, synthetic data can help improve model performance on occluded objects and other real-world challenges, reducing reliance on manual annotation. Finally, integrating simulated data with generative augmentation techniques [208] presents a promising approach to mitigate data scarcity while addressing occlusion-related challenges in traffic scene understanding.

目前,许多广泛使用的交通场景理解任务数据集由合成数据构成,如SYN-THIA[203]和GTA5[204],这些数据集包含在Unity游戏引擎和GTA 5游戏环境中自动标注的交通场景图像。生成式人工智能和数据增强(DA)常用于通过生成稀有或难以捕捉场景的增强样本来解决训练数据的限制,包括遮挡物体或恶劣天气下的事故[205]。尽管取得进展,虚拟仿真数据仍受限,除非交通物体行为逼真且场景具备与真实世界数据相当的高保真图形。一些工作致力于生成用于计算机视觉任务的照片级真实交通场景[206],而近期虚拟引擎的改进使得大规模自动标注的高质量合成数据生成成为可能[207]。增强的渲染能力支持模拟多样化交通场景,包括雨雪或遮挡严重的夜间等挑战条件。对于数据有限或不平衡的任务,合成数据有助于提升模型在遮挡物体及其他现实挑战上的表现,减少对人工标注的依赖。最后,将模拟数据与生成式增强技术结合[208],为缓解数据稀缺及解决交通场景理解中的遮挡问题提供了有前景的途径。

E. IMPROVING PERCEPTION USING MULTI-MODALITY AND DATA FUSION

E. 利用多模态和数据融合提升感知能力

The robustness and comprehensiveness of object detection and segmentation in traffic scenes could be significantly enhanced by leveraging the fusion of data from multimodal sensory inputs, such as panoramic images, LiDAR (Light Detection and Ranging) point clouds, thermal imaging, infrared, and video footage. Additionally, incorporating the sophisticated reasoning capabilities of large language models (LLMs) and multimodal LLMs (MLLMs) [209], [210], [211] could facilitate the integration of real-time text-based and linguistic communication with image and video data [212]. Furthermore, although [213] has made progress in applying language-based knowledge guidance, most research focuses on data fusion in only two domains [214], [215], [216]. A comprehensive benchmark is essential for effectively comparing these works and advancing the development of more optimized and holistic multimodal approaches. Effective multi-sensor data fusion is critical. Designing, assessing, and optimizing the performance of fusion operations for deep generative models are key questions. Interoperability of different multi-modality methods [217] with existing infrastructure and their adaptability to evolving traffic conditions will be crucial for their successful implementation. Recent research, such as Feng et al. [217], emphasizes that multi-modal sensor fusion (e.g., LiDAR, cameras, radar) enhances robustness in object detection by addressing challenges like occlusion and adverse conditions. By effectively integrating complementary information from diverse sensor inputs, occluded objects can be more reliably detected and classified, thereby overcoming one of the significant limitations of single-modality perception approaches.

通过融合多模态传感器输入的数据,如全景图像、激光雷达(LiDAR,Light Detection and Ranging)点云、热成像、红外和视频资料,可以显著增强交通场景中目标检测和分割的鲁棒性与全面性。此外,结合大型语言模型(LLMs)和多模态大型语言模型(MLLMs)[209],[210],[211]的复杂推理能力,有助于实现基于文本的实时语言交流与图像及视频数据的融合[212]。尽管[213]在应用基于语言的知识引导方面取得了一定进展,但大多数研究仅聚焦于两个领域的数据融合[214],[215],[216]。建立一个全面的基准对于有效比较这些工作并推动更优化、更整体的多模态方法的发展至关重要。有效的多传感器数据融合是关键。设计、评估和优化深度生成模型的融合操作性能是核心问题。不同多模态方法[217]与现有基础设施的互操作性及其对不断变化的交通状况的适应性,将是其成功实施的关键。近期研究如Feng等人[217]强调,多模态传感器融合(如激光雷达、摄像头、雷达)通过解决遮挡和恶劣环境等挑战,提高了目标检测的鲁棒性。通过有效整合来自多样传感器输入的互补信息,可以更可靠地检测和分类被遮挡的目标,从而克服单一模态感知方法的重大局限性。

VIII. CONCLUSION

八、结论

In conclusion, this review has provided an extensive exploration of deep learning models and their application to traffic scene understanding, a crucial component in advancing intelligent transportation systems. By categorizing and analyzing discriminative, generative, and domain adaptation models, we have offered a comprehensive perspective on the evolution of traffic scene analysis techniques, highlighting the significant advancements and ongoing challenges in the field. Our discussion on hyperparameter optimization has further emphasized the importance of fine-tuning these models for enhanced efficiency and real-time applicability.

综上所述,本综述对深度学习模型及其在交通场景理解中的应用进行了广泛探讨,交通场景理解是推动智能交通系统发展的关键组成部分。通过对判别模型、生成模型和领域自适应模型的分类与分析,我们提供了交通场景分析技术演进的全面视角,突出展示了该领域的重要进展与持续挑战。我们对超参数优化的讨论进一步强调了微调模型以提升效率和实时应用性的必要性。

This paper has addressed the gaps present in existing literature, such as the lack of focus on generative models, limited coverage of domain adaptation techniques, and insufficient analysis of hyperparameter optimization methods. By presenting a structured comparison of discriminative, generative, and DA models, we provided a nuanced understanding of each category's strengths and weaknesses, which can guide researchers in selecting appropriate models for their specific needs in traffic scene analysis. Furthermore, our review identified emerging areas such as XAI, multi-modal data integration, and real-time processing as pivotal research directions for future work.

本文针对现有文献中的不足进行了补充,如对生成模型关注不足、领域自适应技术覆盖有限以及超参数优化方法分析不充分。通过对判别模型、生成模型和领域自适应模型的结构化比较,我们提供了对各类别优缺点的细致理解,指导研究者根据具体需求选择合适的交通场景分析模型。此外,我们的综述指出了可解释人工智能(XAI)、多模态数据融合和实时处理等新兴领域,作为未来研究的关键方向。

Moving forward, it is evident that there is a growing need to enhance the robustness, interpretability, and efficiency of deep learning systems in traffic environments. We encourage future research efforts to focus on improving model performance under diverse environmental conditions, integrating multiple data sources for richer scene understanding, and advancing explainability to foster trust in AI-driven transportation systems. By addressing these challenges, we believe that deep learning will continue to play a pivotal role in shaping the future of intelligent, safe, and efficient transportation solutions. REFERENCES

展望未来,显然需要提升深度学习系统在交通环境中的鲁棒性、可解释性和效率。我们鼓励未来研究聚焦于提升模型在多样环境条件下的表现,整合多源数据以实现更丰富的场景理解,并推进可解释性以增强对人工智能驱动交通系统的信任。通过应对这些挑战,我们相信深度学习将在塑造智能、安全、高效交通解决方案的未来中继续发挥关键作用。参考文献

[1] A. Boukerche and Z. Hou, "Object detection using deep learning methods in traffic scenarios," ACM Comput. Surv., vol. 54, no. 2, pp. 1-35, Mar. 2021.

[1] A. Boukerche 和 Z. Hou,“基于深度学习方法的交通场景目标检测”,ACM计算机调查,卷54,第2期,页1-35,2021年3月。

[2] Y. Huang and Y. Chen, "Autonomous driving with deep learning: A survey of state-of-art technologies," 2020, arXiv:2006.06091.

[2] Y. Huang 和 Y. Chen,“基于深度学习的自动驾驶:最先进技术综述”,2020年,arXiv:2006.06091。

[3] Z. Guo, Y. Huang, X. Hu, H. Wei, and B. Zhao, "A survey on deep learning based approaches for scene understanding in autonomous driving," Electronics, vol. 10, no. 4, p. 471, Feb. 2021.

[3] Z. Guo, Y. Huang, X. Hu, H. Wei 和 B. Zhao,“基于深度学习的自动驾驶场景理解方法综述”,电子学,卷10,第4期,页471,2021年2月。

[4] S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu, "A survey of deep learning techniques for autonomous driving," J. Field Robot., vol. 37, no. 3, pp. 362-386, Apr. 2020.

[4] S. Grigorescu, B. Trasnea, T. Cocias 和 G. Macesanu,“自动驾驶深度学习技术综述”,现场机器人杂志,卷37,第3期,页362-386,2020年4月。

[5] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proc. IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.

[5] Y. Lecun, L. Bottou, Y. Bengio 和 P. Haffner,“基于梯度的文档识别学习”,IEEE会议录,卷86,第11期,页2278-2324,1998年。

[6] R. C. Luo, H. Potlapalli, and D. W. Hislop, "Translation and scale invariant landmark recognition using receptive field neural networks," in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., vol. 1, Jun. 1992, pp. 527-533, doi: 10.1109/IROS.1992.587385.

[6] R. C. Luo, H. Potlapalli 和 D. W. Hislop,“基于感受野神经网络的平移和尺度不变地标识别”,IEEE/RSJ国际智能机器人系统会议录,卷1,1992年6月,页527-533,doi: 10.1109/IROS.1992.587385。

[7] P. Sermanet and Y. LeCun, "Traffic sign recognition with multi-scale convolutional networks," in Proc. Int. Joint Conf. Neural Netw., Jul. 2011, pp. 2809-2813, doi: 10.1109/IJCNN.2011.6033589.

[7] P. Sermanet 和 Y. LeCun, “基于多尺度卷积网络的交通标志识别,” 载于《国际联合神经网络会议论文集》,2011年7月,第2809-2813页,doi: 10.1109/IJCNN.2011.6033589。

[8] R. Fan, H. Wang, P. Cai, and M. Liu, "SNE-RoadSeg: Incorporating surface normal information into semantic segmentation for accurate freespace detection," in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, Jan. 2020, pp. 340-356.

[8] R. Fan, H. Wang, P. Cai 和 M. Liu, “SNE-RoadSeg:将表面法线信息融入语义分割以实现精确的自由空间检测,” 载于《欧洲计算机视觉会议论文集》,瑞士Cham:Springer出版社,2020年1月,第340-356页。

[9] J. He, C. Zhang, X. He, and R. Dong, "Visual recognition of traffic police gestures with convolutional pose machine and handcrafted features," Neurocomputing, vol. 390, pp. 248-259, May 2020.

[9] J. He, C. Zhang, X. He 和 R. Dong, “结合卷积姿态机和手工特征的交通警察手势视觉识别,” 《神经计算》(Neurocomputing), 第390卷,第248-259页,2020年5月。

[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580-587.

[10] R. Girshick, J. Donahue, T. Darrell 和 J. Malik, “用于精确目标检测和语义分割的丰富特征层次结构,” 载于《IEEE计算机视觉与模式识别会议论文集》,2014年6月,第580-587页。

[11] G. Vinod and G. Padmapriya, "An adaptable real-time object detection for traffic surveillance using R-CNN over CNN with improved accuracy," in Proc. Int. Conf. Bus. Anal. Technol. Secur. (ICBATS), Feb. 2022, pp. 1-4, doi: 10.1109/ICBATS54253.2022.9759030.

[11] G. Vinod 和 G. Padmapriya, “基于改进准确率的R-CNN结合CNN的适应性实时交通监控目标检测,” 载于《国际商业分析技术与安全会议(ICBATS)论文集》,2022年2月,第1-4页,doi: 10.1109/ICBATS54253.2022.9759030。

[12] J. Hosang, M. Omran, R. Benenson, and B. Schiele, "Taking a deeper look at pedestrians," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 4073-4082, doi: 10.1109/CVPR.2015.7299034.

[12] J. Hosang, M. Omran, R. Benenson 和 B. Schiele, “深入研究行人检测,” 载于《IEEE计算机视觉与模式识别会议(CVPR)论文集》,2015年6月,第4073-4082页,doi: 10.1109/CVPR.2015.7299034。

[13] V. Murugan, V. R. Vijaykumar, and A. Nidhila, "A deep learning RCNN approach for vehicle recognition in traffic surveillance system," in Proc. Int. Conf. Commun. Signal Process. (ICCSP), Apr. 2019, pp. 157-160.

[13] V. Murugan, V. R. Vijaykumar 和 A. Nidhila, “基于深度学习的RCNN方法用于交通监控系统中的车辆识别,” 载于《国际通信与信号处理会议(ICCSP)论文集》,2019年4月,第157-160页。

[14] J. Zhang, Z. Xie, J. Sun, X. Zou, and J. Wang, "A cascaded R-CNN with multiscale attention and imbalanced samples for traffic sign detection," IEEE Access, vol. 8, pp. 29742-29754, 2020.

[14] J. Zhang, Z. Xie, J. Sun, X. Zou 和 J. Wang, “结合多尺度注意力和样本不平衡的级联R-CNN用于交通标志检测,” 《IEEE Access》,第8卷,第29742-29754页,2020年。

[15] J. Cao, J. Zhang, and X. Jin, "A traffic-sign detection algorithm based on improved sparse R-CNN," IEEE Access, vol. 9, pp. 122774-122788, 2021, doi: 10.1109/ACCESS.2021.3109606.

[15] J. Cao, J. Zhang 和 X. Jin, “基于改进稀疏R-CNN的交通标志检测算法,” 《IEEE Access》,第9卷,第122774-122788页,2021年,doi: 10.1109/ACCESS.2021.3109606。

[16] C. Lin, Y. Shi, J. Zhang, C. Xie, W. Chen, and Y. Chen, "An anchor-free detector and R-CNN integrated neural network architecture for environmental perception of urban roads," Proc. Inst. Mech. Eng., D, J. Automobile Eng., vol. 235, no. 12, pp. 2964-2973, Oct. 2021.

[16] C. Lin, Y. Shi, J. Zhang, C. Xie, W. Chen 和 Y. Chen, “一种无锚点检测器与R-CNN集成的神经网络架构用于城市道路环境感知,” 《机械工程师学会学报D辑,汽车工程杂志》,第235卷,第12期,第2964-2973页,2021年10月。

[17] P. Li, Y. He, D. Yin, F. R. Yu, and P. Song, "Bagging R-CNN: Ensemble for object detection in complex traffic scenes," in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Jun. 2023, pp. 1-5, doi: 10.1109/ICASSP49357.2023.10097085.

[17] P. Li, Y. He, D. Yin, F. R. Yu 和 P. Song, “Bagging R-CNN:复杂交通场景中的目标检测集成方法,” 载于《IEEE国际声学、语音与信号处理会议(ICASSP)论文集》,2023年6月,第1-5页,doi: 10.1109/ICASSP49357.2023.10097085。

[18] T. Liang, H. Bao, W. Pan, and F. Pan, "Traffic sign detection via improved sparse R-CNN for autonomous vehicles," J. Adv. Transp., vol. 2022, pp. 1-16, Mar. 2022.

[18] T. Liang, H. Bao, W. Pan 和 F. Pan, “基于改进稀疏R-CNN的自动驾驶车辆交通标志检测,” 《先进运输杂志》,2022年,第1-16页,3月。

[19] M. Takahashi, K. Iino, H. Watanabe, I. Morinaga, S. Enomoto, X. Shi, A. Sakamoto, and T. Eda, "Category-based memory bank design for traffic surveillance in context R-CNN," Proc. SPIE, vol. 12592, Mar. 2023, Art. no. 125920G, doi: 10.1117/12.2666991.

[19] M. Takahashi, K. Iino, H. Watanabe, I. Morinaga, S. Enomoto, X. Shi, A. Sakamoto 和 T. Eda, “基于类别的记忆库设计用于Context R-CNN中的交通监控,” 《SPIE会议论文集》,第12592卷,2023年3月,文章编号125920G,doi: 10.1117/12.2666991。

[20] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Li, Z. Yuan, C. Wang, and P. Luo, "Sparse R-CNN: End-to-end object detection with learnable proposals," in Proc. IEEE/CVF Conf. Com-put. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 14449-14458.

[20] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Li, Z. Yuan, C. Wang, 和 P. Luo, “Sparse R-CNN:具有可学习提议的端到端目标检测,” 载于 IEEE/CVF 计算机视觉与模式识别会议(CVPR)论文集,2021年6月,第14449-14458页。

[21] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, "Swin transformer: Hierarchical vision transformer using shifted windows," in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 9992-10002.

[21] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, 和 B. Guo, “Swin Transformer:基于移位窗口的分层视觉Transformer,” 载于 IEEE/CVF 国际计算机视觉会议(ICCV)论文集,2021年10月,第9992-10002页。

[22] R. Girshick, "Fast R-CNN," in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1440-1448.

[22] R. Girshick, “Fast R-CNN,” 载于 IEEE 国际计算机视觉会议(ICCV)论文集,2015年12月,第1440-1448页。

[23] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," 2014, arXiv:1409.1556.

[23] K. Simonyan 和 A. Zisserman, “用于大规模图像识别的非常深的卷积网络,” 2014年,arXiv:1409.1556。

[24] K. He, X. Zhang, S. Ren, and J. Sun, "Spatial pyramid pooling in deep convolutional networks for visual recognition," in Proc. Eur. Conf. Com-put. Vis., vol. 37. Cham, Switzerland: Springer, Jan. 2014, pp. 1904-1916.

[24] K. He, X. Zhang, S. Ren, 和 J. Sun, “深度卷积网络中的空间金字塔池化用于视觉识别,” 载于欧洲计算机视觉会议论文集,第37卷,瑞士Cham:Springer出版社,2014年1月,第1904-1916页。

[25] R. Qian, Q. Liu, Y. Yue, F. Coenen, and B. Zhang, "Road surface traffic sign detection with hybrid region proposal and fast R-CNN," in Proc. 12th Int. Conf. Natural Comput., Fuzzy Syst. Knowl. Discovery (ICNC-FSKD), Aug. 2016, pp. 555-559, doi: 10.1109/FSKD.2016.7603233.

[25] R. Qian, Q. Liu, Y. Yue, F. Coenen, 和 B. Zhang, “基于混合区域提议和Fast R-CNN的道路表面交通标志检测,” 载于第12届国际自然计算、模糊系统与知识发现会议(ICNC-FSKD)论文集,2016年8月,第555-559页,doi: 10.1109/FSKD.2016.7603233。

[26] Z. Zhang, K. Liu, F. Gao, X. Li, and G. Wang, "Vision-based vehicle detecting and counting for traffic flow analysis," in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2016, pp. 2267-2273, doi: 10.1109/IJCNN.2016.7727480.

[26] Z. Zhang, K. Liu, F. Gao, X. Li, 和 G. Wang, “基于视觉的车辆检测与计数用于交通流量分析,” 载于国际神经网络联合会议(IJCNN)论文集,2016年7月,第2267-2273页,doi: 10.1109/IJCNN.2016.7727480。

[27] Z. Moayed, A. Griffin, and R. Klette, "Traffic intersection monitoring using fusion of GMM-based deep learning classification and geometric warping," in Proc. Int. Conf. Image Vis. Comput. New Zealand (IVCNZ), Dec. 2017, pp. 1-5, doi: 10.1109/IVCNZ.2017.8402465.

[27] Z. Moayed, A. Griffin, 和 R. Klette, “基于GMM的深度学习分类与几何变形融合的交通路口监控,” 载于新西兰图像视觉计算国际会议(IVCNZ)论文集,2017年12月,第1-5页,doi: 10.1109/IVCNZ.2017.8402465。

[28] X. Li, L. Li, F. Flohr, J. Wang, H. Xiong, M. Bernhard, S. Pan, D. M. Gavrila, and K. Li, "A unified framework for concurrent pedestrian and cyclist detection," IEEE Trans. Intell. Transp. Syst., vol. 18, no. 2, pp. 269-281, Feb. 2017, doi: 10.1109/TITS.2016.2567418.

[28] X. Li, L. Li, F. Flohr, J. Wang, H. Xiong, M. Bernhard, S. Pan, D. M. Gavrila, 和 K. Li, “行人和骑行者同时检测的统一框架,” IEEE 智能交通系统汇刊,卷18,第2期,2017年2月,第269-281页,doi: 10.1109/TITS.2016.2567418。

[29] K. S. Htet and M. M. Sein, "Event analysis for vehicle classification using fast RCNN," in Proc. IEEE 9th Global Conf. Consum. Electron. (GCCE), Oct. 2020, pp. 403-404, doi: 10.1109/GCCE50665.2020.9291978.

[29] K. S. Htet 和 M. M. Sein, “基于Fast R-CNN的车辆分类事件分析,” 载于IEEE第9届全球消费电子大会(GCCE)论文集,2020年10月,第403-404页,doi: 10.1109/GCCE50665.2020.9291978。

[30] A. Ali, O. G. Olaleye, B. Dey, and M. Bayoumi, "Fast deep pyramid DPM object detection with region proposal networks," in Proc. IEEE Int. Symp. Signal Process. Inf. Technol. (ISSPIT), Dec. 2017, pp. 168-173, doi: 10.1109/ISSPIT.2017.8388636.

[30] A. Ali, O. G. Olaleye, B. Dey, 和 M. Bayoumi, “基于区域提议网络的快速深度金字塔DPM目标检测,” 载于IEEE国际信号处理与信息技术研讨会(ISSPIT)论文集,2017年12月,第168-173页,doi: 10.1109/ISSPIT.2017.8388636。

[31] K. Wang and W. Zhou, "Pedestrian and cyclist detection based on deep neural network fast R-CNN," Int. J. Adv. Robot. Syst., vol. 16, no. 2, Mar. 2019, doi: 10.1177/1729881419829651.

[31] K. Wang 和 W. Zhou, “基于深度神经网络Fast R-CNN的行人和骑行者检测,” 国际先进机器人系统杂志,卷16,第2期,2019年3月,doi: 10.1177/1729881419829651。

[32] N. Arora, Y. Kumar, R. Karkra, and M. Kumar, "Automatic vehicle detection system in different environment conditions using fast R-CNN," Multimedia Tools Appl., vol. 81, no. 13, pp. 18715-18735, May 2022, doi: 10.1007/s11042-022-12347-8.

[32] N. Arora, Y. Kumar, R. Karkra, 和 M. Kumar, “基于快速区域卷积神经网络(fast R-CNN)的不同环境条件下自动车辆检测系统,” 多媒体工具与应用, 卷81, 期13, 页18715-18735, 2022年5月, doi: 10.1007/s11042-022-12347-8.

[33] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards real-time object detection with region proposal networks," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1-14.

[33] S. Ren, K. He, R. Girshick, 和 J. Sun, “Faster R-CNN:基于区域提议网络的实时目标检测方法,” 载于 IEEE 计算机视觉与模式识别会议论文集, 2015年6月, 页1-14.

[34] C. Guindel, D. Martin, and J. M. Armingol, "Fast joint object detection and viewpoint estimation for traffic scene understanding," IEEE Intell. Transp. Syst. Mag., vol. 10, no. 4, pp. 74-86, Winter. 2018.

[34] C. Guindel, D. Martin, 和 J. M. Armingol, “用于交通场景理解的快速联合目标检测与视角估计,” IEEE 智能交通系统杂志, 卷10, 期4, 页74-86, 2018年冬季.

[35] K. Qiao, H. Gu, J. Liu, and P. Liu, "Optimization of traffic sign detection and classification based on faster R-CNN," in Proc. Int. Conf. Com-put. Technol., Electron. Commun. (ICCTEC), Dec. 2017, pp. 608-611, doi: 10.1109/ICCTEC.2017.00137.

[35] K. Qiao, H. Gu, J. Liu, 和 P. Liu, “基于Faster R-CNN的交通标志检测与分类优化,” 载于国际计算技术、电子通信会议(ICCTEC)论文集, 2017年12月, 页608-611, doi: 10.1109/ICCTEC.2017.00137.

[36] G. Wang and X. Ma, "Traffic police gesture recognition using RGB-D and faster R-CNN," in Proc. Int. Conf. Intell. Informat. Biomed. Sci. (ICIIBMS), vol. 3, Oct. 2018, pp. 78-81, doi: 10.1109/ICIIBMS.2018.8549975.

[36] G. Wang 和 X. Ma, “基于RGB-D和Faster R-CNN的交通警察手势识别,” 载于国际智能信息生物医学科学会议(ICIIBMS)论文集, 卷3, 2018年10月, 页78-81, doi: 10.1109/ICIIBMS.2018.8549975.

[37] T. Liu and T. Stathaki, "Faster R-CNN for robust pedestrian detection using semantic segmentation network," Frontiers Neurorobotics, vol. 12, p. 64, Oct. 2018, doi: 10.3389/fnbot.2018.00064.

[37] T. Liu 和 T. Stathaki, “利用语义分割网络的Faster R-CNN实现鲁棒行人检测,” 神经机器人学前沿, 卷12, 页64, 2018年10月, doi: 10.3389/fnbot.2018.00064.

[38] A. Mhalla, T. Chateau, S. Gazzah, and N. E. B. Amara, "An embedded computer-vision system for multi-object detection in traffic surveillance," IEEE Trans. Intell. Transp. Syst., vol. 20, no. 11, pp. 4006-4018, Nov. 2019, doi: 10.1109/TITS.2018.2876614.

[38] A. Mhalla, T. Chateau, S. Gazzah, 和 N. E. B. Amara, “用于交通监控的多目标检测嵌入式计算机视觉系统,” IEEE 智能交通系统汇刊, 卷20, 期11, 页4006-4018, 2019年11月, doi: 10.1109/TITS.2018.2876614.

[39] M. Zinanyuca and D. Arce, "Traffic parameters acquisition system using faster R-CNN deep learning based algorithm," in Proc. IEEE ANDESCON, Oct. 2020, pp. 1-6, doi: 10.1109/ANDESCON50619.2020.9271996.

[39] M. Zinanyuca 和 D. Arce, “基于Faster R-CNN深度学习算法的交通参数采集系统,” 载于 IEEE ANDESCON 会议论文集, 2020年10月, 页1-6, doi: 10.1109/ANDESCON50619.2020.9271996.

[40] X. Gao, L. Chen, K. Wang, X. Xiong, H. Wang, and Y. Li, "Improved traffic sign detection algorithm based on faster R-CNN," Appl. Sci., vol. 12, no. 18, p. 8948, Sep. 2022, doi: 10.3390/app12188948.

[40] X. Gao, L. Chen, K. Wang, X. Xiong, H. Wang, 和 Y. Li, “基于Faster R-CNN的改进交通标志检测算法,” 应用科学, 卷12, 期18, 页8948, 2022年9月, doi: 10.3390/app12188948.

[41] Y. Cui and D. Lei, "Optimizing Internet of Things-based intelligent transportation system's information acquisition using deep learning," IEEE Access, vol. 11, pp. 11804-11810, 2023, doi: 10.1109/ACCESS.2023.3242116.

[41] Y. Cui 和 D. Lei, “利用深度学习优化基于物联网的智能交通系统信息采集,” IEEE Access, 卷11, 页11804-11810, 2023年, doi: 10.1109/ACCESS.2023.3242116.

[42] C. Cao, B. Wang, W. Zhang, X. Zeng, X. Yan, Z. Feng, Y. Liu, and Z. Wu, "An improved faster R-CNN for small object detection," IEEE Access, vol. 7, pp. 106838-106846, 2019, doi: 10.1109/ACCESS.2019.2932731.

[42] C. Cao, B. Wang, W. Zhang, X. Zeng, X. Yan, Z. Feng, Y. Liu, 和 Z. Wu, “一种改进的Faster R-CNN小目标检测方法,” IEEE Access, 卷7, 页106838-106846, 2019年, doi: 10.1109/ACCESS.2019.2932731.

[43] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, "SSD: Single shot MultiBox detector," in Computer Vision-ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., Cham, Switzerland: Springer, 2016, pp. 21-37.

[43] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, 和 A. C. Berg, “SSD:单次多框检测器,” 载于计算机视觉-ECCV 2016, B. Leibe, J. Matas, N. Sebe, 和 M. Welling 编, 瑞士楚格: Springer, 2016年, 页21-37.

[44] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proc. IEEE Conf. Com-put. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779-788.

[44] J. Redmon, S. Divvala, R. Girshick, 和 A. Farhadi, “你只看一次:统一的实时目标检测,” 载于 IEEE 计算机视觉与模式识别会议 (CVPR) 论文集, 2016年6月, 页779-788。

[45] T. Jin, D. Zhang, F. Ding, Z. Zhang, and M. Zhang, "A vehicle detection algorithm in complex traffic scenes," Proc. SPIE, vol. 11519, Jun. 2020, Art. no. 115190C, doi: 10.1117/12.2573189.

[45] T. Jin, D. Zhang, F. Ding, Z. Zhang, 和 M. Zhang, “复杂交通场景下的车辆检测算法,” SPIE 会议论文集, 卷11519, 2020年6月, 文章编号115190C, doi: 10.1117/12.2573189。

[46] J. Redmon and A. Farhadi, "YOLO9000: Better, faster, stronger," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6517-6525.

[46] J. Redmon 和 A. Farhadi, “YOLO9000:更好、更快、更强,” 载于 IEEE 计算机视觉与模式识别会议 (CVPR) 论文集, 2017年7月, 页6517-6525。

[47] J. Redmon and A. Farhadi, "YOLOv3: An incremental improvement," in Proc. Comput. Vis. Pattern Recognit., Jan. 2018. [Online]. Available: https://arxiv.org/abs/1804.02767

[47] J. Redmon 和 A. Farhadi, “YOLOv3:渐进式改进,” 载于计算机视觉与模式识别会议论文集, 2018年1月。[在线]. 可获取:https://arxiv.org/abs/1804.02767

[48] Ultralytics. (2020). YOLOv5. [Online]. Available: https://github.com/ ultralytics/yolov5

[48] Ultralytics. (2020). YOLOv5. [在线]. 可获取:https://github.com/ultralytics/yolov5

[49] X. Li, Z. Xie, X. Deng, Y. Wu, and Y. Pi, "Traffic sign detection based on improved faster R-CNN for autonomous driving," J. Supercomput., vol. 78, no. 6, pp. 7982-8002, Apr. 2022, doi: 10.1007/s11227-021- 04230-4.

[49] X. Li, Z. Xie, X. Deng, Y. Wu, 和 Y. Pi, “基于改进的 faster R-CNN 的自动驾驶交通标志检测,” 超级计算学报, 卷78, 第6期, 页7982-8002, 2022年4月, doi: 10.1007/s11227-021-04230-4。

[50] R. Hu, H. Li, D. Huang, X. Xu, and K. He, "Traffic sign detection based on driving sight distance in haze environment," IEEE Access, vol. 10, pp. 101124-101136, 2022, doi: 10.1109/ACCESS.2022.3208108.

[50] R. Hu, H. Li, D. Huang, X. Xu, 和 K. He, “基于驾驶视距的雾霾环境交通标志检测,” IEEE Access, 卷10, 页101124-101136, 2022年, doi: 10.1109/ACCESS.2022.3208108。

[51] K. He, G. Gkioxari, P. Dollár, and R. Girshick, "Mask R-CNN," in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980-2988.

[51] K. He, G. Gkioxari, P. Dollár, 和 R. Girshick, “Mask R-CNN,” 载于 IEEE 国际计算机视觉会议 (ICCV) 论文集, 2017年10月, 页2980-2988。

[52] S. Sarp, M. Kuzlu, M. Cetin, C. Sazara, and O. Guler, "Detecting floodwater on roadways from image data using mask-R-CNN," in Proc. Int. Conf. Innov. Intell. Syst. Appl. (INISTA), Aug. 2020, pp. 1-6, doi: 10.1109/INISTA49547.2020.9194655.

[52] S. Sarp, M. Kuzlu, M. Cetin, C. Sazara, 和 O. Guler, “基于 mask-R-CNN 的道路洪水检测,” 载于国际创新智能系统应用会议 (INISTA) 论文集, 2020年8月, 页1-6, doi: 10.1109/INISTA49547.2020.9194655。

[53] E. H.-C. Lu, M. Gozdzikiewicz, K.-H. Chang, and J.-M. Ciou, "A hierarchical approach for traffic sign recognition based on shape detection and image classification," Sensors, vol. 22, no. 13, p. 4768, Jun. 2022, doi: 10.3390/s22134768.

[53] E. H.-C. Lu, M. Gozdzikiewicz, K.-H. Chang, 和 J.-M. Ciou, “基于形状检测和图像分类的分层交通标志识别方法,” 传感器, 卷22, 第13期, 页4768, 2022年6月, doi: 10.3390/s22134768。

[54] D. He, Y. Qiu, J. Miao, Z. Zou, K. Li, C. Ren, and G. Shen, "Improved mask R-CNN for obstacle detection of rail transit," Measurement, vol. 190, Feb. 2022, Art. no. 110728.

[54] D. He, Y. Qiu, J. Miao, Z. Zou, K. Li, C. Ren, 和 G. Shen, “改进的 mask R-CNN 用于轨道交通障碍物检测,” 测量学报, 卷190, 2022年2月, 文章编号110728。

[55] L. Lou, Q. Zhang, C. Liu, M. Sheng, J. Liu, and H. Song, "Detecting and counting the moving vehicles using mask R-CNN," in Proc. IEEE 8th Data Driven Control Learn. Syst. Conf. (DDCLS), May 2019, pp. 987-992.

[55] L. Lou, Q. Zhang, C. Liu, M. Sheng, J. Liu, 和 H. Song, “基于 mask R-CNN 的移动车辆检测与计数,” 载于 IEEE 第八届数据驱动控制学习系统会议 (DDCLS) 论文集, 2019年5月, 页987-992。

[56] E. J. Piedad, T.-T. Le, K. Aying, F. K. Pama, and I. Tabale, "Vehicle count system based on time interval image capture method and deep learning mask R-CNN," in Proc. IEEE Region 10 Conf. (TENCON), Oct. 2019, pp. 2675-2679.

[56] E. J. Piedad, T.-T. Le, K. Aying, F. K. Pama, 和 I. Tabale, “基于时间间隔图像采集方法和深度学习 mask R-CNN 的车辆计数系统,” 载于 IEEE 第十区会议 (TENCON) 论文集, 2019年10月, 页2675-2679。

[57] H. Tahir, M. Shahbaz Khan, and M. Owais Tariq, "Performance analysis and comparison of faster R-CNN, mask R-CNN and ResNet50 for the detection and counting of vehicles," in Proc. Int. Conf. Com-put., Commun., Intell. Syst. (ICCCIS), Feb. 2021, pp. 587-594, doi: 10.1109/icccis51004.2021.9397079.

[57] H. Tahir, M. Shahbaz Khan, 和 M. Owais Tariq, “基于 faster R-CNN、mask R-CNN 和 ResNet50 的车辆检测与计数性能分析与比较,” 载于国际计算、通信与智能系统会议(ICCCIS)论文集,2021年2月,第587-594页,doi: 10.1109/icccis51004.2021.9397079。

[58] C. Sazara, M. Cetin, and K. Iftekharuddin, "Image dataset for roadway flooding," Mendeley Data, Amsterdam, The Netherlands, Tech. Rep. V1, 2019. Accessed: Aug. 15, 2024.

[58] C. Sazara, M. Cetin, 和 K. Iftekharuddin, “道路洪水图像数据集,” Mendeley Data,荷兰阿姆斯特丹,技术报告 V1,2019年。访问时间:2024年8月15日。

[59] C. Sazara, M. Cetin, and K. M. Iftekharuddin, "Detecting floodwater on roadways from image data with handcrafted features and deep transfer learning," in Proc. IEEE Intell. Transp. Syst. Conf. (ITSC), Oct. 2019, pp. 804-809, doi: 10.1109/ITSC.2019.8917368.

[59] C. Sazara, M. Cetin, 和 K. M. Iftekharuddin, “基于手工特征与深度迁移学习的道路洪水图像检测,” 载于IEEE智能交通系统会议(ITSC)论文集,2019年10月,第804-809页,doi: 10.1109/ITSC.2019.8917368。

[60] F. Chollet, "Xception: Deep learning with depthwise separable convolutions," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1800-1807.

[60] F. Chollet, “Xception:基于深度可分离卷积的深度学习,” 载于IEEE计算机视觉与模式识别会议(CVPR)论文集,2017年7月,第1800-1807页。

[61] G. Jocher. (May 2020). YOLOv5 by Ultralytics. [Online]. Available: https://github.com/ultralytics/

[61] G. Jocher.(2020年5月)Ultralytics发布的YOLOv5。[在线]. 可获取:https://github.com/ultralytics/

[62] J.-P. Lin and M.-T. Sun, "A YOLO-based traffic counting system," in Proc. Conf. Technol. Appl. Artif. Intell. (TAAI), Nov. 2018, pp. 82-85, doi: 10.1109/TAAI.2018.00027.

[62] J.-P. Lin 和 M.-T. Sun, “基于YOLO的交通流量计数系统,” 载于人工智能技术应用会议(TAAI)论文集,2018年11月,第82-85页,doi: 10.1109/TAAI.2018.00027。

[63] M. B. Jensen, K. Nasrollahi, and T. B. Moeslund, "Evaluating state-of-the-art object detector on challenging traffic light data," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017, pp. 882-888, doi: 10.1109/CVPRW.2017.122.

[63] M. B. Jensen, K. Nasrollahi, 和 T. B. Moeslund, “在复杂交通信号数据上评估最先进目标检测器,” 载于IEEE计算机视觉与模式识别研讨会(CVPRW)论文集,2017年7月,第882-888页,doi: 10.1109/CVPRW.2017.122。

[64] S. P. Rajendran, L. Shine, R. Pradeep, and S. Vijayaraghavan, "Real-time traffic sign recognition using YOLOv3 based detector," in Proc. 10th Int. Conf. Comput., Commun. Netw. Technol. (ICCCNT), Jul. 2019, pp. 1-7, doi: 10.1109/ICCCNT45670.2019.8944890.

[64] S. P. Rajendran, L. Shine, R. Pradeep, 和 S. Vijayaraghavan, “基于YOLOv3的实时交通标志识别,” 载于第十届国际计算、通信与网络技术会议(ICCCNT)论文集,2019年7月,第1-7页,doi: 10.1109/ICCCNT45670.2019.8944890。

[65] J. Yu, X. Ye, and Q. Tu, "Traffic sign detection and recognition in multiimages using a fusion model with YOLO and VGG network," IEEE Trans. Intell. Transp. Syst., vol. 23, no. 9, pp. 16632-16642, Sep. 2022, doi: 10.1109/TITS.2022.3170354.

[65] J. Yu, X. Ye, 和 Q. Tu, “基于YOLO与VGG网络融合模型的多图像交通标志检测与识别,” IEEE智能交通系统汇刊,第23卷第9期,2022年9月,第16632-16642页,doi: 10.1109/TITS.2022.3170354。

[66] Z. Yang, J. Li, and H. Li, "Real-time pedestrian and vehicle detection for autonomous driving," in Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2018, pp. 179-184, doi: 10.1109/IVS.2018.8500642.

[66] Z. Yang, J. Li, 和 H. Li, “自动驾驶中的实时行人及车辆检测,” 载于IEEE智能车辆研讨会(IV)论文集,2018年6月,第179-184页,doi: 10.1109/IVS.2018.8500642。

[67] A. Corovic, V. Ilic, S. Duric, M. Marijan, and B. Pavkovic, "The real-time detection of traffic participants using YOLO algorithm," in Proc. 26th Telecommun. Forum (TELFOR), Nov. 2018, pp. 1-4, doi: 10.1109/TELFOR.2018.8611986.

[67] A. Corovic, V. Ilic, S. Duric, M. Marijan, 和 B. Pavkovic, “基于YOLO算法的交通参与者实时检测,” 载于第26届电信论坛(TELFOR)论文集,2018年11月,第1-4页,doi: 10.1109/TELFOR.2018.8611986。

[68] W. Song and S. A. Suandi, "TSR-YOLO: A Chinese traffic sign recognition algorithm for intelligent vehicles in complex scenes," Sensors, vol. 23, no. 2, p. 749, Jan. 2023, doi: 10.3390/s23020749.

[68] W. Song 和 S. A. Suandi, “TSR-YOLO:复杂场景下智能车辆的中国交通标志识别算法,” 传感器,2023年第23卷第2期,749页,doi: 10.3390/s23020749。

[69] L. Xiaomeng, F. Jun, and C. Peng, "Vehicle detection in traffic monitoring scenes based on improved YOLOV5s," in Proc. Int. Conf. Comput. Eng. Artif. Intell. (ICCEAI), Jul. 2022, pp. 467-471, doi: 10.1109/ICCEAI55464.2022.00103.

[69] L. Xiaomeng, F. Jun, 和 C. Peng, “基于改进YOLOV5s的交通监控场景车辆检测,” 载于国际计算机工程与人工智能会议(ICCEAI)论文集, 2022年7月, 第467-471页, doi: 10.1109/ICCEAI55464.2022.00103.

[70] S. Zhang, S. Che, Z. Liu, and X. Zhang, "A real-time and lightweight traffic sign detection method based on ghost-YOLO," Multimedia Tools Appl., vol. 82, no. 17, pp. 26063-26087, Jul. 2023, doi: 10.1007/s11042- 023-14342-z.

[70] S. Zhang, S. Che, Z. Liu, 和 X. Zhang, “基于ghost-YOLO的实时轻量级交通标志检测方法,” 多媒体工具与应用, 第82卷第17期, 第26063-26087页, 2023年7月, doi: 10.1007/s11042-023-14342-z.

[71] C. Sinthia and Md. H. Kabir, "Detection and recognition of Bangladeshi vehicles' nameplates using YOLOV6 and BLPNET," in Proc. Int. Conf. Electr:, Comput. Commun. Eng. (ECCE), Feb. 2023, pp. 1-6.

[71] C. Sinthia 和 Md. H. Kabir, “利用YOLOV6和BLPNET检测与识别孟加拉国车辆车牌,” 载于国际电气、计算与通信工程会议(ECCE)论文集, 2023年2月, 第1-6页.

[72] T. Suwattanapunkul and L.-J. Wang, "The efficient traffic sign detection and recognition for Taiwan road using YOLO model with hybrid dataset," in Proc. 9th Int. Conf. Appl. Syst. Innov. (ICASI), Apr. 2023, pp. 160-162, doi: 10.1109/ICASI57738.2023.10179493.

[72] T. Suwattanapunkul 和 L.-J. Wang, “基于混合数据集的YOLO模型用于台湾道路高效交通标志检测与识别,” 载于第九届国际应用系统创新会议(ICASI)论文集, 2023年4月, 第160-162页, doi: 10.1109/ICASI57738.2023.10179493.

[73] D. Shokri, C. Larouche, and S. Homayouni, "A comparative analysis of multi-label deep learning classifiers for real-time vehicle detection to support intelligent transportation systems," Smart Cities, vol. 6, no. 5, pp. 2982-3004, Oct. 2023.

[73] D. Shokri, C. Larouche, 和 S. Homayouni, “多标签深度学习分类器在实时车辆检测中支持智能交通系统的比较分析,” 智慧城市, 第6卷第5期, 第2982-3004页, 2023年10月.

[74] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "ImageNet: A large-scale hierarchical image database," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 248-255.

[74] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, 和 L. Fei-Fei, “ImageNet:大规模分层图像数据库,” 载于IEEE计算机视觉与模式识别会议论文集, 2009年6月, 第248-255页.

[75] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, "YOLOv4: Optimal speed and accuracy of object detection," 2020, arXiv:2004.10934.

[75] A. Bochkovskiy, C.-Y. Wang, 和 H.-Y. M. Liao, “YOLOv4:目标检测的最佳速度与精度,” 2020年, arXiv:2004.10934.

[76] C. Dewi, R.-C. Chen, X. Jiang, and H. Yu, "Deep convolutional neural network for enhancing traffic sign recognition developed on YOLO V4," Multimedia Tools Appl., vol. 81, no. 26, pp. 37821-37845, Nov. 2022, doi: 10.1007/s11042-022-12962-5.

[76] C. Dewi, R.-C. Chen, X. Jiang, 和 H. Yu, “基于YOLO V4开发的深度卷积神经网络用于提升交通标志识别,” 多媒体工具与应用, 第81卷第26期, 第37821-37845页, 2022年11月, doi: 10.1007/s11042-022-12962-5.

[77] A. Gomaa and A. Abdalrazik, "Novel deep learning domain adaptation approach for object detection using semi-self building dataset and modified YOLOv4," World Electr. Vehicle J., vol. 15, no. 6, p. 255, Jun. 2024.

[77] A. Gomaa 和 A. Abdalrazik, “基于半自建数据集和改进YOLOv4的目标检测新型深度学习领域自适应方法,” 世界电动车杂志, 第15卷第6期, 第255页, 2024年6月.

[78] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, "RepVGG: Making VGG-style ConvNets great again," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 13728-13737.

[78] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, 和 J. Sun, “RepVGG:让VGG风格卷积网络焕发新生,” 载于IEEE/CVF计算机视觉与模式识别会议(CVPR)论文集, 2021年6月, 第13728-13737页.

[79] C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Cheng, W. Nie, Y. Li, B. Zhang, Y. Liang, L. Zhou, X. Xu, X. Chu, X. Wei, and X. Wei, "YOLOv6: A single-stage object detection framework for industrial applications," 2022, arXiv:2209.02976.

[79] C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Cheng, W. Nie, Y. Li, B. Zhang, Y. Liang, L. Zhou, X. Xu, X. Chu, X. Wei, 和 X. Wei, “YOLOv6:面向工业应用的单阶段目标检测框架,” 2022年, arXiv:2209.02976.

[80] C.-Y. Wang, A. Bochkovskiy, and H.-Y.-M. Liao, "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Los Alamitos, CA, USA, Jun. 2023, pp. 7464-7475.

[80] C.-Y. Wang, A. Bochkovskiy, 和 H.-Y.-M. Liao, “YOLOv7:可训练的免费礼包集创造实时目标检测新标杆,” 载于IEEE/CVF计算机视觉与模式识别会议(CVPR)论文集, 美国加州洛杉矶, 2023年6月, 第7464-7475页.

[81] H. Zhang, Y. Ruan, A. Huo, and X. Jiang, "Traffic sign detection based on improved YOLOv7," in Proc. 5th Int. Conf. Intell. Control, Meas. Signal Process. (ICMSP), May 2023, pp. 71-75, doi: 10.1109/ICMSP58539.2023.10170868.

[81] H. Zhang, Y. Ruan, A. Huo, 和 X. Jiang, “基于改进YOLOv7的交通标志检测,” 载于第5届国际智能控制、测量与信号处理会议(ICMSP)论文集, 2023年5月, 第71-75页, doi: 10.1109/ICMSP58539.2023.10170868.

[82] L. Kantorovitch, "On the translocation of masses," Manage. Sci., vol. 5, no. 1, pp. 1-4, Oct. 1958.

[82] L. Kantorovitch, “关于质量转移,” 管理科学(Manage. Sci.), 第5卷第1期, 第1-4页, 1958年10月.

[83] G. Jocher, A. Chaurasia, and J. Qiu. (2023). Ultralytics YOLOv8.

[83] G. Jocher, A. Chaurasia, 和 J. Qiu. (2023). Ultralytics YOLOv8.

[84] A. Ammar, A. Koubaa, M. Ahmed, A. Saad, and B. Benjdira, "Vehicle detection from aerial images using deep learning: A comparative study," Electronics, vol. 10, no. 7, p. 820, Mar. 2021.

[84] A. Ammar, A. Koubaa, M. Ahmed, A. Saad, 和 B. Benjdira, “基于深度学习的航拍图像车辆检测:一项比较研究,” 电子学(Electronics), 第10卷第7期, 第820页, 2021年3月.

[85] H. Zunair, S. Khan, and A. Ben Hamza, "RSUD20K: A dataset for road scene understanding in autonomous driving," 2024, arXiv:2401. 07322.

[85] H. Zunair, S. Khan, 和 A. Ben Hamza, “RSUD20K:自动驾驶道路场景理解数据集,” 2024年, arXiv:2401.07322.

[86] Z. Wang, S. Yang, H. Qin, Y. Liu, and J. Ding, "CCW-YOLO: A modified YOLOv5s network for pedestrian detection in complex traffic scenes," Information, vol. 15, no. 12, p. 762, Dec. 2024.

[86] Z. Wang, S. Yang, H. Qin, Y. Liu, 和 J. Ding, “CCW-YOLO:一种用于复杂交通场景行人检测的改进YOLOv5s网络,” 信息(Information), 第15卷第12期, 第762页, 2024年12月.

[87] Z. Chen, K. Yang, Y. Wu, H. Yang, and X. Tang, "HCLT-YOLO: A hybrid CNN and lightweight transformer architecture for object detection in complex traffic scenes," IEEE Trans. Veh. Technol., early access, Nov. 12, 2024, doi: 10.1109/TVT.2024.3496513.

[87] Z. Chen, K. Yang, Y. Wu, H. Yang, 和 X. Tang, “HCLT-YOLO:一种用于复杂交通场景目标检测的混合卷积神经网络(CNN)与轻量级变换器(transformer)架构,” IEEE车辆技术学报(IEEE Trans. Veh. Technol.), 预发布, 2024年11月12日, doi: 10.1109/TVT.2024.3496513.

[88] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit,and N. Houlsby,"An image is worth 16×16 words: Transformers for image recognition at scale," 2020, arXiv:2010.11929.

[88] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, 和 N. Houlsby, “一张图片胜过千言万语:大规模图像识别的变换器(transformers),” 2020年, arXiv:2010.11929.

[89] A. Abdelraouf, M. Abdel-Aty, and Y. Wu, "Using vision transformers for spatial-context-aware rain and road surface condition detection on freeways," IEEE Trans. Intell. Transp. Syst., vol. 23, no. 10, pp. 18546-18556, Oct. 2022.

[89] A. Abdelraouf, M. Abdel-Aty, 和 Y. Wu, “利用视觉变换器进行高速公路雨天及路面状况的空间上下文感知检测,” IEEE智能交通系统学报(IEEE Trans. Intell. Transp. Syst.), 第23卷第10期, 第18546-18556页, 2022年10月.

[90] S. Zhao, H. Li, Q. Ke, L. Liu, and R. Zhang, "Action-ViT: Pedestrian intent prediction in traffic scenes," IEEE Signal Process. Lett., vol. 29, pp. 324-328, 2022.

[90] S. Zhao, H. Li, Q. Ke, L. Liu, 和 R. Zhang, “Action-ViT:交通场景中行人意图预测,” IEEE信号处理快报(IEEE Signal Process. Lett.), 第29卷, 第324-328页, 2022年.

[91] M. Kang, W. Lee, K. Hwang, and Y. Yoon, "Vision transformer for detecting critical situations and extracting functional scenario for automated vehicle safety assessment," Sustainability, vol. 14, no. 15, p. 9680, Aug. 2022.

[91] M. Kang, W. Lee, K. Hwang, 和 Y. Yoon, “用于自动驾驶安全评估的视觉变换器:关键情境检测与功能场景提取,” 可持续性(Sustainability), 第14卷第15期, 第9680页, 2022年8月.

[92] J. Wurst, L. Balasubramanian, M. Botsch, and W. Utschick, "Novelty detection and analysis of traffic scenario infrastructures in the latent space of a vision transformer-based triplet autoencoder," in Proc. IEEE Intell. Vehicles Symp. (IV), Jul. 2021, pp. 1304-1311.

[92] J. Wurst, L. Balasubramanian, M. Botsch, 和 W. Utschick, “基于视觉变换器的三元组自编码器潜在空间中交通场景基础设施的新颖性检测与分析,” 载于IEEE智能车辆研讨会(IV)论文集, 2021年7月, 第1304-1311页.

[93] J. Wurst, A. F. Fernández, M. Botsch, and W. Utschick, "An entropy based outlier score and its application to novelty detection for road infrastructure images," in Proc. IEEE Intell. Vehicles Symp. (IV), Oct. 2020, pp. 1436-1443.

[93] J. Wurst, A. F. Fernández, M. Botsch, 和 W. Utschick, “基于熵的异常值评分及其在道路基础设施图像新颖性检测中的应用,” 载于IEEE智能车辆研讨会(IV)论文集, 2020年10月, 第1436-1443页.

[94] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-end object detection with transformers," in Proc. Eur. Conf. Comput. Vis., Jan. 2020, pp. 213-229.

[94] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, 和 S. Zagoruyko, “基于变换器的端到端目标检测,” 载于欧洲计算机视觉会议(Eur. Conf. Comput. Vis.)论文集, 2020年1月, 第213-229页.

[95] J. Xia, M. Li, W. Liu, and X. Chen, "DSRA-DETR: An improved DETR for multiscale traffic sign detection," Sustainability, vol. 15, no. 14, p. 10862, Jul. 2023.

[95] J. Xia, M. Li, W. Liu, 和 X. Chen, “DSRA-DETR:一种改进的DETR用于多尺度交通标志检测,” 《可持续性》(Sustainability), 第15卷,第14期,页10862,2023年7月。

[96] H. Wei, Q. Zhang, Y. Qian, Z. Xu, and J. Han, "MTSDet: Multi-scale traffic sign detection with attention and path aggregation," Appl. Intell., vol. 53, no. 1, pp. 238-250, Jan. 2023.

[96] H. Wei, Q. Zhang, Y. Qian, Z. Xu, 和 J. Han, “MTSDet:基于注意力机制和路径聚合的多尺度交通标志检测,” 《应用智能》(Appl. Intell.), 第53卷,第1期,页238-250,2023年1月。

[97] P. Gao, M. Zheng, X. Wang, J. Dai, and H. Li, "Fast convergence of DETR with spatially modulated co-attention," in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 3601-3610.

[97] P. Gao, M. Zheng, X. Wang, J. Dai, 和 H. Li, “基于空间调制共注意力的DETR快速收敛,” 载于IEEE/CVF国际计算机视觉会议(ICCV)论文集,2021年10月,页3601-3610。

[98] T. Liang, H. Bao, W. Pan, X. Fan, and H. Li, "DetectFormer: Category-assisted transformer for traffic scene object detection," Sensors, vol. 22, no. 13, p. 4833, Jun. 2022.

[98] T. Liang, H. Bao, W. Pan, X. Fan, 和 H. Li, “DetectFormer:基于类别辅助的交通场景目标检测Transformer,” 《传感器》(Sensors), 第22卷,第13期,页4833,2022年6月。

[99] T. N. Kipf and M. Welling, "Semi-supervised classification with graph convolutional networks," 2016, arXiv:1609.02907.

[99] T. N. Kipf 和 M. Welling, “基于图卷积网络的半监督分类,” 2016年,arXiv:1609.02907。

[100] S. Mylavarapu, M. Sandhu, P. Vijayan, K. M. Krishna, B. Ravindran, and A. Namboodiri, "Towards accurate vehicle behaviour classification with multi-relational graph convolutional networks," in Proc. IEEE Intell. Vehicles Symp. (IV), Oct. 2020, pp. 321-327.

[100] S. Mylavarapu, M. Sandhu, P. Vijayan, K. M. Krishna, B. Ravindran, 和 A. Namboodiri, “基于多关系图卷积网络的精确车辆行为分类研究,” 载于IEEE智能车辆研讨会(IV)论文集,2020年10月,页321-327。

[101] K. Liu, Y. Zheng, J. Yang, H. Bao, and H. Zeng, "Chinese traffic police gesture recognition based on graph convolutional network in natural scene," Appl. Sci., vol. 11, no. 24, p. 11951, Dec. 2021.

[101] K. Liu, Y. Zheng, J. Yang, H. Bao, 和 H. Zeng, “基于图卷积网络的自然场景下中国交通警察手势识别,” 《应用科学》(Appl. Sci.), 第11卷,第24期,页11951,2021年12月。

[102] Z. Fang, W. Zhang, Z. Guo, R. Zhi, B. Wang, and F. Flohr, "Traffic police gesture recognition by pose graph convolutional networks," in Proc. IEEE Intell. Vehicles Symp. (IV), Oct. 2020, pp. 1833-1838.

[102] Z. Fang, W. Zhang, Z. Guo, R. Zhi, B. Wang, 和 F. Flohr, “基于姿态图卷积网络的交通警察手势识别,” 载于IEEE智能车辆研讨会(IV)论文集,2020年10月,页1833-1838。

[103] J. Lian, Z. Wang, L. Li, and Y. Zhou, "The understanding of traffic police intention based on visual awareness," Neural Process. Lett., vol. 54, no. 4, pp. 2843-2859, Aug. 2022.

[103] J. Lian, Z. Wang, L. Li, 和 Y. Zhou, “基于视觉感知的交通警察意图理解,” 《神经处理快报》(Neural Process. Lett.), 第54卷,第4期,页2843-2859,2022年8月。

[104] F. Xu, F. Xu, J. Xie, C.-M. Pun, H. Lu, and H. Gao, "Action recognition framework in traffic scene for autonomous driving system," IEEE Trans. Intell. Transp. Syst., vol. 23, no. 11, pp. 22301-22311, Nov. 2022.

[104] F. Xu, F. Xu, J. Xie, C.-M. Pun, H. Lu, 和 H. Gao, “自动驾驶系统中交通场景动作识别框架,” 《IEEE智能交通系统汇刊》(IEEE Trans. Intell. Transp. Syst.), 第23卷,第11期,页22301-22311,2022年11月。

[105] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, "OpenPose: Realtime multi-person 2D pose estimation using part affinity fields," IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 1, pp. 172-186, Jan. 2021.

[105] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, 和 Y. Sheikh, “OpenPose:基于部件亲和场的实时多人二维姿态估计,” 《IEEE模式分析与机器智能汇刊》(IEEE Trans. Pattern Anal. Mach. Intell.), 第43卷,第1期,页172-186,2021年1月。

[106] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lió, and Y. Bengio, "Graph attention networks," 2017, arXiv:1710.10903.

[106] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lió, 和 Y. Bengio, “图注意力网络,” 2017年,arXiv:1710.10903。

[107] P. N. Chowdhury, P. Shivakumara, S. Kanchan, R. Raghavendra, U. Pal, T. Lu, and D. Lopresti, "Graph attention network for detecting license plates in crowded street scenes," Pattern Recognit. Lett., vol. 140, pp. 18-25, Dec. 2020, doi: 10.1016/j.patrec.2020.09.018.

[107] P. N. Chowdhury, P. Shivakumara, S. Kanchan, R. Raghavendra, U. Pal, T. Lu, 和 D. Lopresti, “基于图注意力网络的拥挤街景车牌检测,” 《模式识别快报》(Pattern Recognit. Lett.), 第140期,页18-25,2020年12月,doi: 10.1016/j.patrec.2020.09.018。

[108] Z. Wang, Z. Li, J. Leng, M. Li, and L. Bai, "Multiple pedestrian tracking with graph attention map on urban road scene," IEEE Trans. Intell. Transp. Syst., vol. 24, no. 8, pp. 8567-8579, Aug. 2023.

[108] Z. Wang, Z. Li, J. Leng, M. Li, 和 L. Bai, “基于图注意力图的城市道路多行人跟踪,” 《IEEE智能交通系统汇刊》(IEEE Trans. Intell. Transp. Syst.), 第24卷,第8期,页8567-8579,2023年8月。

[109] T. Monninger, J. Schmidt, J. Rupprecht, D. Raba, J. Jordan, D. Frank, S. Staab, and K. Dietmayer, "SCENE: Reasoning about traffic scenes using heterogeneous graph neural networks," IEEE Robot. Autom. Lett., vol. 8, no. 3, pp. 1531-1538, Mar. 2023.

[109] T. Monninger, J. Schmidt, J. Rupprecht, D. Raba, J. Jordan, D. Frank, S. Staab, 和 K. Dietmayer, “SCENE:使用异构图神经网络推理交通场景,” IEEE机器人与自动化快报, 第8卷, 第3期, 页1531-1538, 2023年3月。

[110] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, "How powerful are graph neural networks?" 2018, arXiv:1810.00826.

[110] K. Xu, W. Hu, J. Leskovec, 和 S. Jegelka, “图神经网络的表达能力有多强?” 2018, arXiv:1810.00826。

[111] A. V. Malawade, S.-Y. Yu, B. Hsu, H. Kaeley, A. Karra, and M. A. A. Faruque, "roadscene2vec: A tool for extracting and embedding road scene-graphs," Knowl.-Based Syst., vol. 242, Apr. 2022, Art. no. 108245.

[111] A. V. Malawade, S.-Y. Yu, B. Hsu, H. Kaeley, A. Karra, 和 M. A. A. Faruque, “roadscene2vec:一种用于提取和嵌入道路场景图的工具,” 知识基础系统, 第242卷, 2022年4月, 文章编号108245。

[112] G. A. Noghre, V. Katariya, A. D. Pazho, C. Neff, and H. Tabkhi, "Pishgu: Universal path prediction network architecture for real-time cyber-physical edge systems," 2022, arXiv:2210.08057.

[112] G. A. Noghre, V. Katariya, A. D. Pazho, C. Neff, 和 H. Tabkhi, “Pishgu:面向实时网络物理边缘系统的通用路径预测网络架构,” 2022, arXiv:2210.08057。

[113] Y. Tian, A. Carballo, R. Li, and K. Takeda, "RSG-search: Semantic traffic scene retrieval using graph-based scene representation," in Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2023, pp. 1-8.

[113] Y. Tian, A. Carballo, R. Li, 和 K. Takeda, “RSG-search:基于图的场景表示的语义交通场景检索,” 载于IEEE智能车辆研讨会(IV), 2023年6月, 页1-8。

[114] J. Wurst, L. Balasubramanian, M. Botsch, and W. Utschick, "Expert-LaSTS: Expert-knowledge guided latent space for traffic scenarios," in Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2022, pp. 484-491.

[114] J. Wurst, L. Balasubramanian, M. Botsch, 和 W. Utschick, “Expert-LaSTS:基于专家知识引导的交通场景潜在空间,” 载于IEEE智能车辆研讨会(IV), 2022年6月, 页484-491。

[115] M. Mendieta and H. Tabkhi, "CARPe posterum: A convolutional approach for real-time pedestrian path prediction," in Proc. AAAI Conf. Artif. Intell., May 2021, vol. 35, no. 3, pp. 2346-2354.

[115] M. Mendieta 和 H. Tabkhi, “CARPe posterum:一种用于实时行人路径预测的卷积方法,” 载于AAAI人工智能会议, 2021年5月, 第35卷, 第3期, 页2346-2354。

[116] S. Sabour, N. Frosst, and G. E. Hinton, "Dynamic routing between capsules," in Proc. Adv. Neural Inf. Process. Syst., vol. 30, Jan. 2017, pp. 3859-3869.

[116] S. Sabour, N. Frosst, 和 G. E. Hinton, “胶囊间的动态路由,” 载于神经信息处理系统进展会议, 第30卷, 2017年1月, 页3859-3869。

[117] A. Dinesh Kumar, "Novel deep learning model for traffic sign detection using capsule networks," 2018, arXiv:1805.04424.

[117] A. Dinesh Kumar, “基于胶囊网络的交通标志检测新型深度学习模型,” 2018, arXiv:1805.04424。

[118] X. Liu, W. Q. Yan, and N. Kasabov, "Vehicle-related scene segmentation using CapsNets," in Proc. 35th Int. Conf. Image Vis. Comput. New Zealand (IVCNZ), Nov. 2020, pp. 1-6, doi: 10.1109/IVCNZ51579.2020.9290664.

[118] X. Liu, W. Q. Yan, 和 N. Kasabov, “基于CapsNets的车辆相关场景分割,” 载于第35届新西兰图像视觉计算国际会议(IVCNZ), 2020年11月, 页1-6, doi: 10.1109/IVCNZ51579.2020.9290664。

[119] Z. Hao, "The method of recognizing traffic signs based on the improved capsule network," in Proc. Int. Conf. Comput. Eng. Intell. Control (ICCEIC), Nov. 2020, pp. 22-26.

[119] Z. Hao, “基于改进胶囊网络的交通标志识别方法,” 载于国际计算工程与智能控制会议(ICCEIC), 2020年11月, 页22-26。

[120] X. Liu and W. Q. Yan, "Traffic-light sign recognition using capsule network," Multimedia Tools Appl., vol. 80, no. 10, pp. 15161-15171, Apr. 2021, doi: 10.1007/s11042-020-10455-x.

[120] X. Liu 和 W. Q. Yan, “基于胶囊网络的交通信号灯标志识别,” 多媒体工具与应用, 第80卷, 第10期, 页15161-15171, 2021年4月, doi: 10.1007/s11042-020-10455-x。

[121] W. Yang and W. Zhang, "Real-time traffic signs detection based on YOLO network model," in Proc. Int. Conf. Cyber-Enabled Dis-trib. Comput. Knowl. Discovery (CyberC), Oct. 2020, pp. 354-357, doi: 10.1109/CyberC49757.2020.00066.

[121] W. Yang 和 W. Zhang, “基于YOLO网络模型的实时交通标志检测,” 载于国际网络启用分布式计算与知识发现会议(CyberC), 2020年10月, 页354-357, doi: 10.1109/CyberC49757.2020.00066。

[122] Y. Liu, G. Shi, Y. Li, and Z. Zhao, "M-YOLO: Traffic sign detection algorithm applicable to complex scenarios," Symmetry, vol. 14, no. 5, p. 952, May 2022, doi: 10.3390/sym14050952.

[122] Y. Liu, G. Shi, Y. Li, 和 Z. Zhao, “M-YOLO:适用于复杂场景的交通标志检测算法,” 对称性, 第14卷, 第5期, 页952, 2022年5月, doi: 10.3390/sym14050952。

[123] C. Dewi, R.-C. Chen, Y.-T. Liu, X. Jiang, and K. D. Hartomo, "YOLO V4 for advanced traffic sign recognition with synthetic training data generated by various GAN," IEEE Access, vol. 9, pp. 97228-97242, 2021, doi: 10.1109/ACCESS.2021.3094201.

[123] C. Dewi, R.-C. Chen, Y.-T. Liu, X. Jiang, 和 K. D. Hartomo, "YOLO V4用于利用多种生成对抗网络(GAN)生成的合成训练数据进行高级交通标志识别," IEEE Access, 第9卷, 页97228-97242, 2021, doi: 10.1109/ACCESS.2021.3094201.

[124] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets," in Advances in Neural Information Processing Systems, vol. 27, Z. Ghahra-mani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, Eds., Red Hook, NY, USA: Curran Associates, 2014.

[124] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, 和 Y. Bengio, "生成对抗网络(Generative adversarial nets)," 载于神经信息处理系统进展(Advances in Neural Information Processing Systems), 第27卷, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, 和 K. Q. Weinberger编辑, 纽约雷德胡克, 美国: Curran Associates, 2014.

[125] K. Zhang, X. Feng, N. Jia, L. Zhao, and Z. He, "TSR-GAN: Generative adversarial networks for traffic state reconstruction with time space diagrams," Phys. A, Stat. Mech. Appl., vol. 591, Apr. 2022, Art. no. 126788.

[125] K. Zhang, X. Feng, N. Jia, L. Zhao, 和 Z. He, "TSR-GAN:基于生成对抗网络的交通状态重建及时空图分析," 物理A,统计力学及其应用(Phys. A, Stat. Mech. Appl.), 第591卷, 2022年4月, 文章编号126788.

[126] P. König, S. Aigner, and M. Körner, "Enhancing traffic scene predictions with generative adversarial networks," in Proc. IEEE Intell. Transp. Syst. Conf. (ITSC), Oct. 2019, pp. 1768-1775.

[126] P. König, S. Aigner, 和 M. Körner, "利用生成对抗网络提升交通场景预测," 载于IEEE智能交通系统会议(ITSC)论文集, 2019年10月, 页1768-1775.

[127] Y. Cai, L. Dai, H. Wang, and Z. Li, "Multi-target pan-class intrinsic relevance driven model for improving semantic segmentation in autonomous driving," IEEE Trans. Image Process., vol. 30, pp. 9069-9084, 2021.

[127] Y. Cai, L. Dai, H. Wang, 和 Z. Li, "基于多目标泛类内在相关性的模型提升自动驾驶语义分割性能," IEEE图像处理汇刊(IEEE Trans. Image Process.), 第30卷, 页9069-9084, 2021.

[128] W. Xu, N. Souly, and P. P. Brahma, "Reliability of GAN generated data to train and validate perception systems for autonomous vehicles," in Proc. IEEE Winter Conf. Appl. Comput. Vis. Workshops (WACVW), Jan. 2021, pp. 171-180.

[128] W. Xu, N. Souly, 和 P. P. Brahma, "用于训练和验证自动驾驶感知系统的生成对抗网络(GAN)生成数据的可靠性," 载于IEEE冬季计算机视觉应用会议研讨会(WACVW)论文集, 2021年1月, 页171-180.

[129] M. Uricár, G. Sistu, H. Rashed, A. Vobecký, V. R. Kumar, P. Krížek, F. Bürger, and S. Yogamani, "Let's get dirty: GAN based data augmentation for camera lens soiling detection in autonomous driving," in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2021, pp. 766-775.

[129] M. Uricár, G. Sistu, H. Rashed, A. Vobecký, V. R. Kumar, P. Krížek, F. Bürger, 和 S. Yogamani, "让我们弄脏它:基于GAN的数据增强用于自动驾驶摄像头镜头污渍检测," 载于IEEE冬季计算机视觉应用会议(WACV)论文集, 2021年1月, 页766-775.

[130] X. Cheng, J. Zhou, J. Song, and X. Zhao, "A highway traffic image enhancement algorithm based on improved GAN in complex weather conditions," IEEE Trans. Intell. Transp. Syst., vol. 24, no. 8, pp. 8716-8726, Aug. 2023.

[130] X. Cheng, J. Zhou, J. Song, 和 X. Zhao, "基于改进生成对抗网络(GAN)的复杂天气条件下高速公路交通图像增强算法," IEEE智能交通系统汇刊(IEEE Trans. Intell. Transp. Syst.), 第24卷第8期, 页8716-8726, 2023年8月.

[131] C. Jiqing, W. Depeng, L. Teng, L. Tian, and W. Huabin, "All-weather road drivable area segmentation method based on CycleGAN," Vis. Comput., vol. 39, no. 10, pp. 5135-5151, Oct. 2023.

[131] C. Jiqing, W. Depeng, L. Teng, L. Tian, 和 W. Huabin, "基于CycleGAN的全天候道路可行驶区域分割方法," 视觉计算(Vis. Comput.), 第39卷第10期, 页5135-5151, 2023年10月.

[132] A. Mukherjee, A. Joshi, C. Hegde, and S. Sarkar, "Semantic domain adaptation for deep classifiers via gan-based data augmentation," in Proc. Conf. Neural Inf. Process. Syst. Workshops, 2019, pp. 1-7.

[132] A. Mukherjee, A. Joshi, C. Hegde, 和 S. Sarkar, "基于GAN的数据增强实现深度分类器的语义域适应," 载于神经信息处理系统会议研讨会论文集, 2019, 页1-7.

[133] A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, H. Rezatofighi, and S. Savarese, "SoPhie: An attentive GAN for predicting paths compliant to social and physical constraints," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 1349-1358.

[133] A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, H. Rezatofighi, 和 S. Savarese, "SoPhie:一种符合社会和物理约束的注意力生成对抗网络(GAN)路径预测方法," 载于IEEE/CVF计算机视觉与模式识别会议(CVPR)论文集, 2019年6月, 页1349-1358.

[134] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, "Photo-realistic single image super-resolution using a generative adversarial network," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 105-114.

[134] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, 和 W. Shi, "基于生成对抗网络的照片级真实感单幅图像超分辨率重建," 载于IEEE计算机视觉与模式识别会议(CVPR)论文集, 2017年7月, 页105-114.

[135] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas, "DeblurGAN: Blind motion deblurring using conditional adversarial networks," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8183-8192.

[135] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, 和 J. Matas, “DeblurGAN:使用条件对抗网络的盲运动去模糊,” 载于 IEEE/CVF 计算机视觉与模式识别会议论文集, 2018年6月, 页码 8183-8192。

[136] S. Aigner and M. Körner, "FutureGAN: Anticipating the future frames of video sequences using spatio-temporal 3D convolutions in progressively growing GANs," 2018, arXiv:1810.01325.

[136] S. Aigner 和 M. Körner, “FutureGAN:利用时空三维卷积在逐步增长的生成对抗网络中预测视频序列的未来帧,” 2018, arXiv:1810.01325。

[137] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, "Unpaired image-to-image translation using cycle-consistent adversarial networks," in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2242-2251.

[137] J.-Y. Zhu, T. Park, P. Isola, 和 A. A. Efros, “使用循环一致性对抗网络进行无配对图像到图像的转换,” 载于 IEEE 国际计算机视觉会议(ICCV)论文集, 2017年10月, 页码 2242-2251。

[138] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, "AttGAN: Facial attribute editing by only changing what you want," IEEE Trans. Image Process., vol. 28, no. 11, pp. 5464-5478, Nov. 2019.

[138] Z. He, W. Zuo, M. Kan, S. Shan, 和 X. Chen, “AttGAN:仅通过改变所需内容进行面部属性编辑,” IEEE 图像处理汇刊, 第28卷第11期, 页码 5464-5478, 2019年11月。

[139] F. Lateef, M. Kas, A. Chahi, and Y. Ruichek, "A two-stream conditional generative adversarial network for improving semantic predictions in urban driving scenes," Eng. Appl. Artif. Intell., vol. 133, Jul. 2024, Art. no. 108290.

[139] F. Lateef, M. Kas, A. Chahi, 和 Y. Ruichek, “一种用于提升城市驾驶场景语义预测的双流条件生成对抗网络,” 工程应用人工智能, 第133卷, 2024年7月, 文章编号 108290。

[140] D. P. Kingma and M. Welling, "Auto-encoding variational Bayes," 2013, arXiv:1312.6114.

[140] D. P. Kingma 和 M. Welling, “自动编码变分贝叶斯,” 2013, arXiv:1312.6114。

[141] L. Gou, L. Zou, N. Li, M. Hofmann, A. K. Shekar, A. Wendt, and L. Ren, "VATLD: A visual analytics system to assess, understand and improve traffic light detection," IEEE Trans. Vis. Comput. Graph., vol. 27, no. 2, pp. 261-271, Feb. 2021.

[141] L. Gou, L. Zou, N. Li, M. Hofmann, A. K. Shekar, A. Wendt, 和 L. Ren, “VATLD:一个用于评估、理解和改进交通信号灯检测的可视分析系统,” IEEE 可视化与计算机图形汇刊, 第27卷第2期, 页码 261-271, 2021年2月。

[142] Z. Chen and L. Liu, "NSS-VAEs: Generative scene decomposition for visual navigable space construction," 2021, arXiv:2111.01127.

[142] Z. Chen 和 L. Liu, “NSS-VAEs:用于视觉可导航空间构建的生成场景分解,” 2021, arXiv:2111.01127。

[143] V. K. Sundar, S. Ramakrishna, Z. Rahiminasab, A. Easwaran, and A. Dubey, "Out-of-distribution detection in multi-label datasets using latent space of B-VAE," in Proc. IEEE Secur. Privacy Workshops (SPW), May 2020, pp. 250-255.

[143] V. K. Sundar, S. Ramakrishna, Z. Rahiminasab, A. Easwaran, 和 A. Dubey, “利用B-VAE潜在空间进行多标签数据集的分布外检测,” 载于 IEEE 安全与隐私研讨会(SPW)论文集, 2020年5月, 页码 250-255。

[144] S. Tan, K. Wong, S. Wang, S. Manivasagam, M. Ren, and R. Urtasun, "SceneGen: Learning to generate realistic traffic scenes," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 892-901.

[144] S. Tan, K. Wong, S. Wang, S. Manivasagam, M. Ren, 和 R. Urtasun, “SceneGen:学习生成真实交通场景,” 载于 IEEE/CVF 计算机视觉与模式识别会议(CVPR)论文集, 2021年6月, 页码 892-901。

[145] W. Ding, H. Lin, B. Li, and D. Zhao, "Semantically adversarial scenario generation with explicit knowledge guidance," 2021, arXiv:2106.04066.

[145] W. Ding, H. Lin, B. Li, 和 D. Zhao, “带有显式知识引导的语义对抗场景生成,” 2021, arXiv:2106.04066。

[146] N. Aslam and M. H. Kolekar, "A-VAE: Attention based variational autoencoder for traffic video anomaly detection," in Proc. IEEE 8th Int. Conf. Converg. Technol. (I2CT), Apr. 2023, pp. 1-7.

[146] N. Aslam 和 M. H. Kolekar, “A-VAE:基于注意力的变分自编码器用于交通视频异常检测,” 载于 IEEE 第八届融合技术国际会议(I2CT)论文集, 2023年4月, 页码 1-7。

[147] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner,"Understanding disentangling in β -VAE," 2018, arXiv:1804.03599.

[147] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, 和 A. Lerchner, “理解 β -VAE 中的解耦,” 2018, arXiv:1804.03599。

[148] Z. Li, C. Zhang, G. Meng, and Y. Liu, "Joint haze image synthesis and dehazing with mmd-vae losses," 2019, arXiv:1905.05947.

[148] Z. Li, C. Zhang, G. Meng, 和 Y. Liu, “基于mmd-vae损失的联合雾图像合成与去雾,” 2019, arXiv:1905.05947。

[149] Q. Tian and J. Sun, "Cluster-based dual-branch contrastive learning for unsupervised domain adaptation person re-identification," Knowl.-Based Syst., vol. 280, Nov. 2023, Art. no. 111026.

[149] Q. Tian 和 J. Sun, “基于聚类的双分支对比学习用于无监督领域自适应行人重识别,” 知识基系统, 第280卷, 2023年11月, 文章编号 111026。

[150] X. Gao, Z. Chen, J. Wei, R. Wang, and Z. Zhao, "Deep mutual distillation for unsupervised domain adaptation person re-identification," IEEE Trans. Multimedia, early access, Sep. 12, 2024, doi: 10.1109/TMM.2024.3459637.

[150] X. Gao, Z. Chen, J. Wei, R. Wang, 和 Z. Zhao, “用于无监督领域自适应行人重识别的深度互相蒸馏,” IEEE 多媒体学报, 预发布, 2024年9月12日, doi: 10.1109/TMM.2024.3459637.

[151] G. Mattolin, L. Zanella, E. Ricci, and Y. Wang, "ConfMix: Unsupervised domain adaptation for object detection via confidence-based mixing," in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2023, pp. 423-433.

[151] G. Mattolin, L. Zanella, E. Ricci, 和 Y. Wang, “ConfMix:基于置信度混合的无监督领域自适应目标检测,” 载于 IEEE/CVF 冬季计算机视觉应用会议 (WACV) 论文集, 2023年1月, 页423-433.

[152] D. Shenaj, E. Fanì, M. Toldo, D. Caldarola, A. Tavera, U. Michieli, M. Ciccone, P. Zanuttigh, and B. Caputo, "Learning across domains and devices: Style-driven source-free domain adaptation in clustered federated learning," in Proc. IEEE/CVF Winter Conf. Appl. Com-put. Vis. (WACV), Jan. 2023, pp. 444-454.

[152] D. Shenaj, E. Fanì, M. Toldo, D. Caldarola, A. Tavera, U. Michieli, M. Ciccone, P. Zanuttigh, 和 B. Caputo, “跨域与跨设备学习:聚类联邦学习中的风格驱动无源域自适应,” 载于 IEEE/CVF 冬季计算机视觉应用会议 (WACV) 论文集, 2023年1月, 页444-454.

[153] Y. Zheng, D. Huang, S. Liu, and Y. Wang, "Cross-domain object detection through coarse-to-fine feature adaptation," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 13763-13772.

[153] Y. Zheng, D. Huang, S. Liu, 和 Y. Wang, “通过粗到细特征适应实现跨域目标检测,” 载于 IEEE/CVF 计算机视觉与模式识别会议 (CVPR) 论文集, 2020年6月, 页13763-13772.

[154] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton, "Big self-supervised models are strong semi-supervised learners," in Proc. Adv. Neural Inf. Process. Syst., Jan. 2020, pp. 22243-22255.

[154] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, 和 G. E. Hinton, “大型自监督模型是强大的半监督学习者,” 载于神经信息处理系统大会 (NeurIPS), 2020年1月, 页22243-22255.

[155] T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, "A simple framework for contrastive learning of visual representations," in Proc. 37th Int. Conf. Mach. Learn., Jan. 2020, pp. 1597-1607.

[155] T. Chen, S. Kornblith, M. Norouzi, 和 G. E. Hinton, “视觉表征对比学习的简单框架,” 载于第37届国际机器学习大会 (ICML), 2020年1月, 页1597-1607.

[156] G. Hinton, O. Vinyals, and J. Dean, "Distilling the knowledge in a neural network," 2015, arXiv:1503.02531.

[156] G. Hinton, O. Vinyals, 和 J. Dean, “神经网络中的知识蒸馏,” 2015年, arXiv:1503.02531.

[157] S. Kullback and R. A. Leibler, "On information and sufficiency," Ann. Math. Statist., vol. 22, no. 1, pp. 79-86, Mar. 1951.

[157] S. Kullback 和 R. A. Leibler, “论信息与充分性,” 数学统计年刊, 第22卷第1期, 页79-86, 1951年3月.

[158] A. Gretton, K. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola, "A kernel two-sample test," J. Mach. Learn. Res., vol. 13, no. 1, pp. 723-773, Mar. 2012.

[158] A. Gretton, K. Borgwardt, M. J. Rasch, B. Schölkopf, 和 A. J. Smola, “核两样本检验,” 机器学习研究杂志, 第13卷第1期, 页723-773, 2012年3月.

[159] Z. Zhao, S. Wei, Q. Chen, D. Li, Y. Yang, Y. Peng, and Y. Liu, "Masked retraining teacher-student framework for domain adaptive object detection," in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, pp. 18993-19003.

[159] Z. Zhao, S. Wei, Q. Chen, D. Li, Y. Yang, Y. Peng, 和 Y. Liu, “用于领域自适应目标检测的掩码重训练师生框架,” 载于 IEEE/CVF 国际计算机视觉大会 (ICCV), 2023年10月, 页18993-19003.

[160] K. Gong, S. Li, S. Li, R. Zhang, C. H. Liu, and Q. Chen, "Improving transferability for domain adaptive detection transformers," in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, pp. 1543-1551.

[160] K. Gong, S. Li, S. Li, R. Zhang, C. H. Liu, 和 Q. Chen, “提升领域自适应检测变换器的迁移能力,” 载于第30届ACM国际多媒体会议, 2022年10月, 页1543-1551.

[161] G. Li, Z. Ji, Y. Chang, S. Li, X. Qu, and D. Cao, "ML-ANet: A transfer learning approach using adaptation network for multi-label image classification in autonomous driving," Chin. J. Mech. Eng., vol. 34, no. 1, p. 78, Dec. 2021.

[161] G. Li, Z. Ji, Y. Chang, S. Li, X. Qu, 和 D. Cao, “ML-ANet:一种用于自动驾驶多标签图像分类的迁移学习适应网络,” 中国机械工程学报, 第34卷第1期, 页78, 2021年12月.

[162] D. Mekhazni, A. Bhuiyan, G. Ekladious, and E. Granger, "Unsupervised domain adaptation in the dissimilarity space for person re-identification," in Computer Vision-ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds., Cham, Switzerland: Springer, 2020, pp. 159-174.

[162] D. Mekhazni, A. Bhuiyan, G. Ekladious, 和 E. Granger, “基于不相似度空间的无监督领域自适应行人重识别,” 载于计算机视觉-ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, 和 J.-M. Frahm 编, 瑞士Cham: Springer, 2020年, 页159-174.

[163] C.-Y. Lee, T. Batra, M. H. Baig, and D. Ulbricht, "Sliced Wasserstein discrepancy for unsupervised domain adaptation," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 10277-10287.

[163] C.-Y. Lee, T. Batra, M. H. Baig, 和 D. Ulbricht, “用于无监督领域自适应的切片Wasserstein差异(Sliced Wasserstein discrepancy),” 载于 IEEE/CVF 计算机视觉与模式识别会议(CVPR)论文集,2019年6月,第10277-10287页。

[164] A.-D. Doan, B. L. Nguyen, S. Gupta, I. Reid, M. Wagner, and T.-J. Chin, "Assessing domain gap for continual domain adaptation in object detection," Comput. Vis. Image Understand., vol. 238, Jan. 2024, Art. no. 103885.

[164] A.-D. Doan, B. L. Nguyen, S. Gupta, I. Reid, M. Wagner, 和 T.-J. Chin, “用于目标检测中持续领域自适应的领域差距评估,” 计算机视觉与图像理解(Comput. Vis. Image Understand.),第238卷,2024年1月,文章编号103885。

[165] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, "Image-to-image translation with conditional adversarial networks," 2016, arXiv:1611.07004.

[165] P. Isola, J.-Y. Zhu, T. Zhou, 和 A. A. Efros, “基于条件对抗网络的图像到图像翻译,” 2016, arXiv:1611.07004.

[166] M.-Y. Liu, T. M. Breuel, and J. Kautz, "Unsupervised image-to-image translation networks," in Proc. Adv. Neural Inf. Process. Syst., vol. 30, Jan. 2017, pp. 700-708.

[166] M.-Y. Liu, T. M. Breuel, 和 J. Kautz, “无监督图像到图像翻译网络,” 载于《神经信息处理系统进展》(Adv. Neural Inf. Process. Syst.),第30卷,2017年1月,第700-708页.

[167] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim, "Image to image translation for domain adaptation," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 4500-4509.

[167] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, 和 K. Kim, “用于领域自适应的图像到图像翻译,” 载于《IEEE/CVF计算机视觉与模式识别会议》(Conf. Comput. Vis. Pattern Recognit.),2018年6月,第4500-4509页.

[168] J. Lee, D. Shiotsuka, G. Bang, Y. Endo, T. Nishimori, K. Nakao, and S. Kamijo, "Day-to-night image translation via transfer learning to keep semantic information for driving simulator," IATSS Res., vol. 47, no. 2, pp. 251-262, Jul. 2023.

[168] J. Lee, D. Shiotsuka, G. Bang, Y. Endo, T. Nishimori, K. Nakao, 和 S. Kamijo, “通过迁移学习实现的昼夜图像翻译以保持驾驶模拟器的语义信息,” 《IATSS研究》,第47卷,第2期,2023年7月,第251-262页.

[169] D. Kothandaraman, A. Nambiar, and A. Mittal, "Domain adaptive knowledge distillation for driving scene semantic segmentation," in Proc. IEEE Winter Conf. Appl. Comput. Vis. Workshops (WACVW), Jan. 2021, pp. 134-143.

[169] D. Kothandaraman, A. Nambiar, 和 A. Mittal, “用于驾驶场景语义分割的领域自适应知识蒸馏,” 载于《IEEE冬季应用计算机视觉会议研讨会》(WACVW),2021年1月,第134-143页.

[170] H. Wang, S. Liao, and L. Shao, "AFAN: Augmented feature alignment network for cross-domain object detection," IEEE Trans. Image Process., vol. 30, pp. 4046-4056, 2021.

[170] H. Wang, S. Liao, 和 L. Shao, “AFAN:用于跨领域目标检测的增强特征对齐网络,” 《IEEE图像处理汇刊》,第30卷,2021年,第4046-4056页.

[171] J. Li, R. Xu, X. Liu, J. Ma, B. Li, Q. Zou, J. Ma, and H. Yu, "Domain adaptation based object detection for autonomous driving in foggy and rainy weather," 2023, arXiv:2307.09676.

[171] J. Li, R. Xu, X. Liu, J. Ma, B. Li, Q. Zou, J. Ma, 和 H. Yu, “基于领域自适应的自动驾驶雾雨天气目标检测,” 2023, arXiv:2307.09676.

[172] Y. Guo, R. Liang, Y. Cui, X. Zhao, and Q. Meng, "A domain-adaptive method with cycle perceptual consistency adversarial networks for vehicle target detection in foggy weather," IET Intell. Transp. Syst., vol. 16, no. 7, pp. 971-981, Jul. 2022.

[172] Y. Guo, R. Liang, Y. Cui, X. Zhao, 和 Q. Meng, “一种基于循环感知一致性对抗网络的领域自适应方法用于雾天车辆目标检测,” 《IET智能交通系统》,第16卷,第7期,2022年7月,第971-981页.

[173] X. Yu and X. Lu, "Domain adaptation of anchor-free object detection for urban traffic," Neurocomputing, vol. 582, May 2024, Art. no. 127477.

[173] X. Yu 和 X. Lu, “面向城市交通的无锚目标检测领域自适应,” 《神经计算》,第582卷,2024年5月,文章编号127477.

[174] M. Biasetton, U. Michieli, G. Agresti, and P. Zanuttigh, "Unsupervised domain adaptation for semantic segmentation of urban scenes," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2019, pp. 1211-1220.

[174] M. Biasetton, U. Michieli, G. Agresti, 和 P. Zanuttigh, “用于城市场景语义分割的无监督领域自适应,” 载于《IEEE/CVF计算机视觉与模式识别会议研讨会》(CVPRW),2019年6月,第1211-1220页.

[175] M. Saffari, M. Khodayar, and S. M. J. Jalali, "Sparse adversarial unsupervised domain adaptation with deep dictionary learning for traffic scene classification," IEEE Trans. Emerg. Topics Comput. Intell., vol. 7, no. 4, pp. 1139-1150, Apr. 2023.

[175] M. Saffari, M. Khodayar, 和 S. M. J. Jalali, “基于深度字典学习的稀疏对抗无监督领域自适应用于交通场景分类,” 《IEEE新兴计算智能专题汇刊》,第7卷,第4期,2023年4月,第1139-1150页.

[176] M. Saffari and M. Khodayar, "Low-rank sparse generative adversarial unsupervised domain adaptation for multitarget traffic scene semantic segmentation," IEEE Trans. Ind. Informat., vol. 20, no. 2, pp. 2564-2576, Feb. 2024.

[176] M. Saffari 和 M. Khodayar, “用于多目标交通场景语义分割的低秩稀疏生成对抗无监督领域自适应,” 《IEEE工业信息学汇刊》,第20卷,第2期,2024年2月,第2564-2576页.

[177] H. Zhang, G. Luo, J. Li, and F.-Y. Wang, "C2FDA: Coarse-to-fine domain adaptation for traffic object detection," IEEE Trans. Intell. Transp. Syst., vol. 23, no. 8, pp. 12633-12647, Aug. 2022.

[177] H. Zhang, G. Luo, J. Li, 和 F.-Y. Wang, “C2FDA:用于交通目标检测的粗到细领域自适应,” 《IEEE智能交通系统汇刊》,第23卷,第8期,2022年8月,第12633-12647页.

[178] Q. Zhou, Q. Gu, J. Pang, X. Lu, and L. Ma, "Self-adversarial disentangling for specific domain adaptation," IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 7, pp. 8954-8968, Jul. 2023.

[178] Q. Zhou, Q. Gu, J. Pang, X. Lu, 和 L. Ma, “特定领域自适应的自对抗解耦,” 《IEEE模式分析与机器智能汇刊》,第45卷,第7期,2023年7月,第8954-8968页.

[179] J. Wang, T. Shen, Y. Tian, Y. Wang, C. Gou, X. Wang, F. Yao, and C. Sun, "A parallel teacher for synthetic-to-real domain adaptation of traffic object detection," IEEE Trans. Intell. Vehicles, vol. 7, no. 3, pp. 441-455, Sep. 2022.

[179] J. Wang, T. Shen, Y. Tian, Y. Wang, C. Gou, X. Wang, F. Yao, 和 C. Sun, “用于交通目标检测的合成到真实域适应的并行教师模型,” IEEE智能车辆汇刊, 第7卷, 第3期, 页441-455, 2022年9月。

[180] L. Zhang, P. Ratsamee, B. Wang, Z. Luo, Y. Uranishi, M. Higashida, and H. Takemura, "Panoptic-aware image-to-image translation," in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2023, pp. 259-268.

[180] L. Zhang, P. Ratsamee, B. Wang, Z. Luo, Y. Uranishi, M. Higashida, 和 H. Takemura, “全景感知的图像到图像转换,” 载于IEEE/CVF冬季应用计算机视觉会议(WACV)论文集, 2023年1月, 页259-268。

[181] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell, "CyCADA: Cycle-consistent adversarial domain adaptation," in Proc. Int. Conf. Mach. Learn., Jan. 2017, pp. 1989-1998.

[181] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, 和 T. Darrell, “CyCADA: 循环一致性对抗域适应,” 载于国际机器学习会议论文集, 2017年1月, 页1989-1998。

[182] G. Bang, J. Lee, Y. Endo, T. Nishimori, K. Nakao, and S. Kamijo, "Semantic and geometric-aware day-to-night image translation network," Sensors, vol. 24, no. 4, p. 1339, Feb. 2024.

[182] G. Bang, J. Lee, Y. Endo, T. Nishimori, K. Nakao, 和 S. Kamijo, “语义与几何感知的昼夜图像转换网络,” 传感器, 第24卷, 第4期, 页1339, 2024年2月。

[183] T.-D. Truong, N. Le, B. Raj, J. Cothren, and K. Luu, "FREDOM: Fairness domain adaptation approach to semantic scene understanding," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 19988-19997.

[183] T.-D. Truong, N. Le, B. Raj, J. Cothren, 和 K. Luu, “FREDOM: 用于语义场景理解的公平性域适应方法,” 载于IEEE/CVF计算机视觉与模式识别会议(CVPR)论文集, 2023年6月, 页19988-19997。

[184] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1-9.

[184] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, 和 A. Rabinovich, “深入卷积网络,” 载于IEEE计算机视觉与模式识别会议(CVPR)论文集, 2015年6月, 页1-9。

[185] A. Cherian and A. Sullivan, "Sem-GAN: Semantically-consistent image-to-image translation," in Proc. IEEE Winter Conf. Appl. Com-put. Vis. (WACV), Jan. 2019, pp. 1797-1806.

[185] A. Cherian 和 A. Sullivan, “Sem-GAN: 语义一致的图像到图像转换,” 载于IEEE冬季应用计算机视觉会议(WACV)论文集, 2019年1月, 页1797-1806。

[186] S.-W. Huang, C.-T. Lin, S. Chen, Y.-Y. Wu, P.-H. Hsu, and S. Lai, "Aug-GAN: Cross domain adaptation with GAN-based data augmentation," in Proc. Eur. Conf. Comput. Vis. (ECCV), Jan. 2018, pp. 731-744.

[186] S.-W. Huang, C.-T. Lin, S. Chen, Y.-Y. Wu, P.-H. Hsu, 和 S. Lai, “Aug-GAN: 基于GAN的数据增强的跨域适应,” 载于欧洲计算机视觉会议(ECCV)论文集, 2018年1月, 页731-744。

[187] R. Volpi, P. Morerio, S. Savarese, and V. Murino, "Adversarial feature augmentation for unsupervised domain adaptation," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 5495-5504.

[187] R. Volpi, P. Morerio, S. Savarese, 和 V. Murino, “用于无监督域适应的对抗特征增强,” 载于IEEE/CVF计算机视觉与模式识别会议论文集, 2018年6月, 页5495-5504。

[188] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," 2017, arXiv:1706.03762.

[188] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, 和 I. Polosukhin, “注意力机制即一切,” 2017年, arXiv:1706.03762。

[189] M. Salem, A. Gomaa, and N. Tsurusaki, "Detection of earthquake-induced building damages using remote sensing data and deep learning: A case study of mashiki town, Japan," in Proc. IEEE Int. Geosci. Remote Sens. Symp., Jul. 2023, pp. 2350-2353.

[189] M. Salem, A. Gomaa, 和 N. Tsurusaki, “利用遥感数据和深度学习检测地震引起的建筑损伤:以日本益城町为例,” 载于IEEE国际地球科学与遥感研讨会论文集, 2023年7月, 页2350-2353。

[190] A. Gomaa, M. M. Abdelwahab, and M. Abo-Zahhad, "Real-time algorithm for simultaneous vehicle detection and tracking in aerial view videos," in Proc. IEEE 61st Int. Midwest Symp. Circuits Syst. (MWSCAS), Aug. 2018, pp. 222-225.

[190] A. Gomaa, M. M. Abdelwahab, 和 M. Abo-Zahhad, “用于航拍视频中车辆实时检测与跟踪的算法,” 载于IEEE第61届中西部电路与系统研讨会(MWSCAS)论文集, 2018年8月, 页222-225。

[191] M. A. Khan and H. Park, "Exploring explainable artificial intelligence techniques for interpretable neural networks in traffic sign recognition systems," Electronics, vol. 13, no. 2, p. 306, Jan. 2024.

[191] M. A. Khan 和 H. Park, “探索可解释人工智能技术以实现交通标志识别系统中神经网络的可解释性,” 电子学, 第13卷, 第2期, 页306, 2024年1月。

[192] C. Bustos, D. Rhoads, A. Solé-Ribalta, D. Masip, A. Arenas, A. Lapedriza, and J. Borge-Holthoefer, "Explainable, automated urban interventions to improve pedestrian and vehicle safety," Transp. Res. C, Emerg. Technol., vol. 125, Apr. 2021, Art. no. 103018.

[192] C. Bustos, D. Rhoads, A. Solé-Ribalta, D. Masip, A. Arenas, A. Lapedriza, 和 J. Borge-Holthoefer, “可解释的自动化城市干预以提升行人和车辆安全,”《交通研究C,紧急技术》(Transp. Res. C, Emerg. Technol.),第125卷,2021年4月,文章编号103018。

[193] S. Kolekar, S. Gite, B. Pradhan, and A. Alamri, "Explainable AI in scene understanding for autonomous vehicles in unstructured traffic environments on Indian roads using the inception U-Net model with grad-CAM visualization," Sensors, vol. 22, no. 24, p. 9677, Dec. 2022.

[193] S. Kolekar, S. Gite, B. Pradhan, 和 A. Alamri, “基于Inception U-Net模型及grad-CAM可视化的可解释人工智能在印度道路非结构化交通环境中自动驾驶场景理解中的应用,”《传感器》(Sensors),第22卷,第24期,页9677,2022年12月。

[194] J. Dong, S. Chen, M. Miralinaghi, T. Chen, P. Li, and S. Labi, "Why did the AI make that decision? Towards an explainable artificial intelligence (XAI) for autonomous driving systems," Transp. Res. C, Emerg. Technol., vol. 156, Nov. 2023, Art. no. 104358.

[194] J. Dong, S. Chen, M. Miralinaghi, T. Chen, P. Li, 和 S. Labi, “人工智能为何做出该决策?面向自动驾驶系统的可解释人工智能(XAI)研究,”《交通研究C,紧急技术》(Transp. Res. C, Emerg. Technol.),第156卷,2023年11月,文章编号104358。

[195] K. Han, Y. Wang, J. Guo, Y. Tang, and E. Wu, "Vision GNN: An image is worth graph of nodes," in Proc. Adv. Neural Inf. Process. Syst., A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., Jan. 2022, pp. 8291-8303.

[195] K. Han, Y. Wang, J. Guo, Y. Tang, 和 E. Wu, “视觉图神经网络(Vision GNN):一幅图像胜过节点图,”载于《神经信息处理系统进展》(Proc. Adv. Neural Inf. Process. Syst.),A. H. Oh, A. Agarwal, D. Belgrave, 和 K. Cho编辑,2022年1月,第8291-8303页。

[196] J. Regan and M. Khodayar, "A triplet graph convolutional network with attention and similarity-driven dictionary learning for remote sensing image retrieval," Expert Syst. Appl., vol. 232, Dec. 2023, Art. no. 120579.

[196] J. Regan 和 M. Khodayar, “基于三元组图卷积网络结合注意力机制与相似性驱动字典学习的遥感图像检索,”《专家系统应用》(Expert Syst. Appl.),第232卷,2023年12月,文章编号120579。

[197] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, Z. Yang, Y. Zhang, and D. Tao, "A survey on vision transformer," IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 87-110, Jan. 2023.

[197] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, Z. Yang, Y. Zhang, 和 D. Tao, “视觉变换器(Vision Transformer)综述,”《IEEE模式分析与机器智能汇刊》(IEEE Trans. Pattern Anal. Mach. Intell.),第45卷,第1期,2023年1月,页87-110。

[198] G. Li, M. Müller, A. Thabet, and B. Ghanem, "DeepGCNs: Can GCNs Go As Deep As CNNs?" in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2019, pp. 9266-9275.

[198] G. Li, M. Müller, A. Thabet, 和 B. Ghanem, “DeepGCNs:图卷积网络(GCNs)能否达到卷积神经网络(CNNs)的深度?”载于《IEEE国际计算机视觉会议论文集》(Proc. IEEE Int. Conf. Comput. Vis.),2019年10月,页9266-9275。

[199] T. K. Rusch, M. M. Bronstein, and S. Mishra, "A survey on oversmooth-ing in graph neural networks," 2023, arXiv:2303.10993.

[199] T. K. Rusch, M. M. Bronstein, 和 S. Mishra, “图神经网络中过度平滑(oversmoothing)现象综述,”2023年,arXiv:2303.10993。

[200] J. Li, Q. Zhang, W. Liu, A. B. Chan, and Y.-G. Fu, "Another perspective of over-smoothing: Alleviating semantic over-smoothing in deep GNNs," IEEE Trans. Neural Netw. Learn. Syst., early access, May 29, 2024, doi: 10.1109/TNNLS.2024.3402317.

[200] J. Li, Q. Zhang, W. Liu, A. B. Chan, 和 Y.-G. Fu, “过度平滑的另一视角:缓解深层图神经网络中的语义过度平滑,”《IEEE神经网络与学习系统汇刊》(IEEE Trans. Neural Netw. Learn. Syst.),提前发布,2024年5月29日,doi: 10.1109/TNNLS.2024.3402317。

[201] L. J. Zhang, J. J. Fang, Y. X. Liu, H. Feng Le, Z. Q. Rao, and J. X. Zhao, "CR-YOLOv8: Multiscale object detection in traffic sign images," IEEE Access, vol. 12, pp. 219-228, 2024.

[201] L. J. Zhang, J. J. Fang, Y. X. Liu, H. Feng Le, Z. Q. Rao, 和 J. X. Zhao, “CR-YOLOv8:交通标志图像中的多尺度目标检测,”《IEEE Access》,第12卷,2024年,页219-228。

[202] S. R. Dubey and S. K. Singh, "Transformer-based generative adversarial networks in computer vision: A comprehensive survey," IEEE Trans. Artif. Intell., vol. 5, no. 10, pp. 4851-4867, Oct. 2024.

[202] S. R. Dubey 和 S. K. Singh, “基于变换器的生成对抗网络在计算机视觉中的应用综述,”《IEEE人工智能汇刊》(IEEE Trans. Artif. Intell.),第5卷,第10期,2024年10月,页4851-4867。

[203] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, "The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes," in Proc. IEEE Conf. Com-put. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 3234-3243.

[203] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, 和 A. M. Lopez, “SYNTHIA数据集:用于城市场景语义分割的大规模合成图像集合,”载于《IEEE计算机视觉与模式识别会议论文集》(Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR)),2016年6月,页3234-3243。

[204] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, "Playing for data: Ground truth from computer games," in Proc. Eur. Conf. Comput. Vis., in Lecture Notes in Computer Science: Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, vol. 9906, Jan. 2016, pp. 102-118.

[204] S. R. Richter, V. Vineet, S. Roth, 和 V. Koltun, “为数据而玩:来自计算机游戏的真实标注”,发表于欧洲计算机视觉会议论文集,计算机科学讲义系列:包括人工智能讲义和生物信息学讲义子系列,第9906卷,2016年1月,第102-118页。

[205] R. Zhang, K. Xiong, H. Du, D. Niyato, J. Kang, X. Shen, and H. V. Poor, "Generative AI-enabled vehicular networks: Fundamentals, framework, and case study," IEEE Netw., vol. 38, no. 4, pp. 259-267, Jul. 2024.

[205] R. Zhang, K. Xiong, H. Du, D. Niyato, J. Kang, X. Shen, 和 H. V. Poor, “生成式人工智能驱动的车载网络:基础、框架与案例研究”,IEEE网络,卷38,第4期,2024年7月,第259-267页。

[206] E. Galazka, T. T. Niemirepo, and J. Vanne, "CiThruS2: Open-source photorealistic 3D framework for driving and traffic simulation in real time," in Proc. IEEE Int. Intell. Transp. Syst. Conf. (ITSC), Sep. 2021, pp. 3284-3291.

[206] E. Galazka, T. T. Niemirepo, 和 J. Vanne, “CiThruS2:用于实时驾驶和交通仿真的开源逼真三维框架”,发表于IEEE国际智能交通系统会议(ITSC),2021年9月,第3284-3291页。

[207] X. Li, J. Park, C. Reberg-Horton, S. Mirsky, E. Lobaton, and L. Xiang, "Photorealistic arm robot simulation for 3D plant reconstruction and automatic annotation using unreal engine 5," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2024, pp. 5480-5488.

[207] X. Li, J. Park, C. Reberg-Horton, S. Mirsky, E. Lobaton, 和 L. Xiang, “基于Unreal Engine 5的逼真机械臂仿真,用于三维植物重建和自动标注”,发表于IEEE/CVF计算机视觉与模式识别会议研讨会(CVPRW),2024年6月,第5480-5488页。

[208] E. Yurtsever, D. Yang, I. M. Koc, and K. A. Redmill, "Photorealism in driving simulations: Blending generative adversarial image synthesis with rendering," IEEE Trans. Intell. Transp. Syst., vol. 23, no. 12, pp. 23114-23123, Dec. 2022.

[208] E. Yurtsever, D. Yang, I. M. Koc, 和 K. A. Redmill, “驾驶仿真中的逼真效果:生成对抗图像合成与渲染的融合”,IEEE智能交通系统汇刊,卷23,第12期,2022年12月,第23114-23123页。

[209] S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, "A survey on multimodal large language models," 2023, arXiv:2306.13549.

[209] S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, 和 E. Chen, “多模态大型语言模型综述”,2023年,arXiv:2306.13549。

[210] H. Wang, J. Qin, A. Bastola, X. Chen, J. Suchanek, Z. Gong, and A. Razi, "VisionGPT: LLM-assisted real-time anomaly detection for safe visual navigation," 2024, arXiv:2403.12415.

[210] H. Wang, J. Qin, A. Bastola, X. Chen, J. Suchanek, Z. Gong, 和 A. Razi, “VisionGPT:基于大型语言模型辅助的实时异常检测,用于安全视觉导航”,2024年,arXiv:2403.12415。

[211] T.-A. To, M.-N. Tran, T.-B. Ho, T.-L. Ha, Q.-T. Nguyen, H.-C. Luong, T.-D. Cao, and M.-T. Tran, "Multi-perspective traffic video description model with fine-grained refinement approach," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2024, pp. 7075-7084.

[211] T.-A. To, M.-N. Tran, T.-B. Ho, T.-L. Ha, Q.-T. Nguyen, H.-C. Luong, T.-D. Cao, 和 M.-T. Tran, “多视角交通视频描述模型及细粒度优化方法”,发表于IEEE/CVF计算机视觉与模式识别会议研讨会(CVPRW),2024年6月,第7075-7084页。

[212] C. Cui, Y. Ma, X. Cao, W. Ye, and Z. Wang, "Receive, reason, and react: Drive as you say, with large language models in autonomous vehicles," IEEE Intell. Transp. Syst. Mag., vol. 16, no. 4, pp. 81-94, Jul. 2024.

[212] C. Cui, Y. Ma, X. Cao, W. Ye, 和 Z. Wang, “接收、推理与响应:在自动驾驶中实现‘言出即行’的多模态大型语言模型”,IEEE智能交通系统杂志,卷16,第4期,2024年7月,第81-94页。

[213] L. Kong, X. Xu, J. Ren, W. Zhang, L. Pan, K. Chen, W. T. Ooi, and Z. Liu, "Multi-modal data-efficient 3D scene understanding for autonomous driving," 2024, arXiv:2405.05258.

[213] L. Kong, X. Xu, J. Ren, W. Zhang, L. Pan, K. Chen, W. T. Ooi, 和 Z. Liu, “面向自动驾驶的多模态高效三维场景理解”,2024年,arXiv:2405.05258。

[214] K. Dasgupta, A. Das, S. Das, U. Bhattacharya, and S. Yogamani, "Spatio-contextual deep network-based multimodal pedestrian detection for autonomous driving," IEEE Trans. Intell. Transp. Syst., vol. 23, no. 9, pp. 15940-15950, Sep. 2022.

[214] K. Dasgupta, A. Das, S. Das, U. Bhattacharya, 和 S. Yogamani, “基于时空上下文深度网络的多模态行人检测,用于自动驾驶”,IEEE智能交通系统汇刊,卷23,第9期,2022年9月,第15940-15950页。

[215] T. Qian, J. Chen, L. Zhuo, Y. Jiao, and Y. Jiang, "NuScenes-QA: A multi-modal visual question answering benchmark for autonomous driving scenario," in Proc. AAAI Conf. Artif. Intell., Jan. 2023, vol. 38, no. 5, pp. 4542-4550.

[215] T. Qian, J. Chen, L. Zhuo, Y. Jiao, 和 Y. Jiang, “NuScenes-QA:面向自动驾驶场景的多模态视觉问答基准”,发表于AAAI人工智能会议,2023年1月,卷38,第5期,第4542-4550页。

[216] J. Li, Y. Zhang, P. Yun, G. Zhou, Q. Chen, and R. Fan, "RoadFormer: Duplex transformer for RGB-normal semantic road scene parsing," IEEE Trans. Intell. Vehicles, vol. 9, no. 7, pp. 5163-5172, Jul. 2024.

[216] J. Li, Y. Zhang, P. Yun, G. Zhou, Q. Chen, 和 R. Fan, “RoadFormer:用于RGB-法线语义道路场景解析的双工变换器”,IEEE智能车辆汇刊,卷9,第7期,页5163-5172,2024年7月。

[217] D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, C. Gläser, F. Timm, W. Wiesbeck, and K. Dietmayer, "Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges," IEEE Trans. Intell. Transp. Syst., vol. 22, no. 3, pp. 1341-1360, Mar. 2021.

[217] D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, C. Gläser, F. Timm, W. Wiesbeck, 和 K. Dietmayer, “自动驾驶的深度多模态目标检测与语义分割:数据集、方法与挑战”,IEEE智能交通系统汇刊,卷22,第3期,页1341-1360,2021年3月。

PARYA DOLATYABI (Graduate Student Member, IEEE) received the B.Sc. degree in computer science from the Shahid Bahonar University of Kerman, Kerman, Iran, in 2003, the M.Sc. degree in information technology engineering from the K. N. Toosi University of Technology, Tehran, Iran, in 2007, and the M.Sc. degree in computer engineering from the University of Tulsa (TU), Tulsa, OK, USA, in 2024, where she is currently pursuing the Ph.D. degree in computer science. In 2023, she completed an internship as a Research Assistant with the Laureate Institute for Brain Research (LIBR), Tulsa. Her primary research interests include the theories and applications of deep learning models in computer vision and computational neuroscience. Additionally, she serves as a Reviewer for IEEE TRANSACTIONS ON TRANSPORTATION ELECTRIFICATION and Sustainable Computing: Informatics and Systems journals.

PARYA DOLATYABI(IEEE研究生会员)于2003年获得伊朗克尔曼Shahid Bahonar大学计算机科学学士学位,2007年获得伊朗德黑兰K. N. Toosi理工大学信息技术工程硕士学位,2024年获得美国俄克拉荷马州塔尔萨大学(University of Tulsa,TU)计算机工程硕士学位,目前在该校攻读计算机科学博士学位。2023年,她在塔尔萨Laureate脑研究所(LIBR)完成了研究助理实习。她的主要研究兴趣包括深度学习模型在计算机视觉和计算神经科学中的理论与应用。此外,她还担任IEEE TRANSACTIONS ON TRANSPORTATION ELECTRIFICATION和Sustainable Computing: Informatics and Systems期刊的审稿人。

JACOB REGAN (Graduate Student Member, IEEE) received the B.Sc. and M.Sc. degrees in computer science from the University of Tulsa (TU), Tulsa, Oklahoma, in 2021 and 2022, respectively, where he is currently pursuing the Ph.D. degree in computer science. His main research interests include artificial intelligence, machine learning, computer vision, and transportation network simulation and optimization.

JACOB REGAN(IEEE研究生会员)于2021年和2022年分别获得美国俄克拉荷马州塔尔萨大学(University of Tulsa,TU)计算机科学学士和硕士学位,目前在该校攻读计算机科学博士学位。他的主要研究兴趣包括人工智能、机器学习、计算机视觉以及交通网络仿真与优化。

MAHDI KHODAYAR (Member, IEEE) received the B.Sc. degree in computer engineering and the M.Sc. degree in artificial intelligence from the K. N. Toosi University of Technology, Tehran, Iran, in 2013 and 2015, respectively, and the Ph.D. degree in electrical engineering from Southern Methodist University, Dallas, TX, USA, in 2020. In 2017, he was a Research Assistant with the College of Computer and Information Science, Northeastern University, Boston, MA, USA. He is currently an Assistant Professor with the Department of Computer Science, The University of Tulsa, Tulsa, OK, USA. His main research interests include machine learning and statistical pattern recognition. He is focused on DL, sparse modeling, and spatiotemporal pattern recognition. He has served as a Reviewer for many reputable journals, including IEEE TRANSACTIONS ON Neural Networks and Learning Systems, IEEE Transactions on Industrial Informatics, IEEE Transactions on Fuzzy Systems, IEEE Transactions on Sustainable Energy, and IEEE Transactions on Power Systems. Additionally, he serves as an Editor for IEEE TRANSACTIONS ON TRANSPORTATION ELECTRIFICATION.

MAHDI KHODAYAR(IEEE会员)于2013年和2015年分别获得伊朗德黑兰K. N. Toosi理工大学计算机工程学士和人工智能硕士学位,2020年获得美国德克萨斯州达拉斯南方卫理公会大学(Southern Methodist University)电气工程博士学位。2017年,他曾在美国马萨诸塞州波士顿东北大学计算机与信息科学学院担任研究助理。目前,他是美国俄克拉荷马州塔尔萨大学计算机科学系助理教授。他的主要研究兴趣包括机器学习和统计模式识别,重点关注深度学习(DL)、稀疏建模和时空模式识别。他曾为多家知名期刊担任审稿人,包括IEEE神经网络与学习系统汇刊(IEEE TRANSACTIONS ON Neural Networks and Learning Systems)、IEEE工业信息学汇刊(IEEE Transactions on Industrial Informatics)、IEEE模糊系统汇刊(IEEE Transactions on Fuzzy Systems)、IEEE可持续能源汇刊(IEEE Transactions on Sustainable Energy)和IEEE电力系统汇刊(IEEE Transactions on Power Systems)。此外,他还担任IEEE TRANSACTIONS ON TRANSPORTATION ELECTRIFICATION期刊的编辑。